paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.81k stars 660 forks source link

Got Essential task `overseer` failed error after upgrading Kusama and Polkadot validator to v1.5.0 #2728

Open AlexZhenWang opened 9 months ago

AlexZhenWang commented 9 months ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Hi, I am trying to upgrade our both Polkadot and Kusama validator to v1.5.0. But I got an Essential task overseer failed error after upgrading Kusama and Polkadot validator to v1.5.0.

After downgrading back to v1.4.0, the issue gone. This error happened on both Polkadot and Kusama validator. The logs:

image

Update: logs

2023-12-18 01:10:33 ----------------------------    
2023-12-18 01:10:33 This chain is not in any way    
2023-12-18 01:10:33       endorsed by the           
2023-12-18 01:10:33      KUSAMA FOUNDATION          
2023-12-18 01:10:33 ----------------------------    
2023-12-18 01:10:33 Parity Polkadot    
2023-12-18 01:10:33 ✌️  version 1.5.0-a3dc2f15f23    
2023-12-18 01:10:33 ❤️  by Parity Technologies <admin@parity.io>, 2017-2023    
2023-12-18 01:10:33 📋 Chain specification: Kusama    
2023-12-18 01:10:33 🏷  Node name: OnfinalityV#1    
2023-12-18 01:10:33 👤 Role: AUTHORITY    
2023-12-18 01:10:33 💾 Database: RocksDb at /chain-data/chains/ksmcc3/db/full    
2023-12-18 01:11:01 🏷  Local node identity is: 12D3KooWHcDiChr6KkEfgCQtSiLbTzcDRhyKHyqtMvn515QWrWWt    
2023-12-18 01:11:01 🔍 Discovered new external address for our node: /ip4/142.215.53.19/tcp/30000/p2p/12D3KooWHcDiChr6KkEfgCQtSiLbTzcDRhyKHyqtMvn515QWrWWt    
2023-12-18 01:11:08 🚀 Using prepare-worker binary at: "/usr/lib/polkadot/polkadot-prepare-worker"    
2023-12-18 01:11:08 🚀 Using execute-worker binary at: "/usr/lib/polkadot/polkadot-execute-worker"    
2023-12-18 01:11:09 💻 Operating system: linux    
2023-12-18 01:11:09 💻 CPU architecture: x86_64    
2023-12-18 01:11:09 💻 Target environment: gnu    
2023-12-18 01:11:09 💻 CPU: Intel Xeon Processor (Cascadelake)    
2023-12-18 01:11:09 💻 CPU cores: 16    
2023-12-18 01:11:09 💻 Memory: 32115MB    
2023-12-18 01:11:09 💻 Kernel: 5.4.0-89-generic    
2023-12-18 01:11:09 💻 Linux distribution: Ubuntu 22.04.3 LTS    
2023-12-18 01:11:09 💻 Virtual machine: yes    
2023-12-18 01:11:09 📦 Highest known block at #21034600    
2023-12-18 01:11:09 Running JSON-RPC server: addr=127.0.0.1:9900, allowed origins=["*"]    
2023-12-18 01:11:09 🏁 CPU score: 1.02 GiBs    
2023-12-18 01:11:09 🏁 Memory score: 4.26 GiBs    
2023-12-18 01:11:09 🏁 Disk score (seq. writes): 626.57 MiBs    
2023-12-18 01:11:09 🏁 Disk score (rand. writes): 327.96 MiBs    
2023-12-18 01:11:09 ⚠️  The hardware does not meet the minimal requirements Failed checks: Copy(expected: 11.49 GiBs, found: 4.26 GiBs), Seq Write(expected: 950.00 MiBs, found: 626.57 MiBs), Rnd Write(expected: 420.00 MiBs, found: 327.96 MiBs),  for role 'Authority' find out more at:
https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware    
2023-12-18 01:11:09 👶 Starting BABE Authorship worker    
2023-12-18 01:11:09 🚨 Your system cannot securely run a validator. 
Running validation of malicious PVF code has a higher risk of compromising this machine.
  - Cannot enable landlock, a Linux 5.13+ kernel security feature: not available: Could not fully enable: NotEnforced
  - Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and mount namespaces: Operation not permitted (os error 1)
You can ignore this error with the `--insecure-validator-i-know-what-i-do` command line argument if you understand and accept the risks of running insecurely. With this flag, security features are enabled on a best-effort basis, but not mandatory. 
More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode
2023-12-18 01:11:09 🥩 BEEFY gadget waiting for BEEFY pallet to become available...    
2023-12-18 01:11:09 subsystem exited with error subsystem="candidate-validation" err=FromOrigin { origin: "candidate-validation", source: Context("could not enable Secure Validator Mode; check logs") }
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="candidate-validation"
2023-12-18 01:11:09 subsystem finished unexpectedly subsystem=Ok(())
2023-12-18 01:11:09 Received `Conclude` signal, exiting
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="pvf-checker"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="bitfield-signing"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="network-bridge-rx"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="candidate-backing"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="network-bridge-tx"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="availability-recovery"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="runtime-api"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="collation-generation"
2023-12-18 01:11:09 received `Conclude` signal, exiting
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="gossip-support"
2023-12-18 01:11:09 Conclude
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="availability-store"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="provisioner"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="bitfield-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="dispute-distribution"
2023-12-18 01:11:09 received `Conclude` signal, exiting
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="chain-selection"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="chain-api"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="statement-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="approval-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="availability-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="collator-protocol"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="approval-voting"
2023-12-18 01:11:09 discovered: 12D3KooWPEovdRYLyAw8phHWXLmJncnRYM6euSvPiahthcYKdeLv /ip4/172.17.0.1/tcp/30333    
2023-12-18 01:11:09 discovered: 12D3KooWPEovdRYLyAw8phHWXLmJncnRYM6euSvPiahthcYKdeLv /ip4/142.215.53.19/tcp/30333    
2023-12-18 01:11:10 subsystem exited with error subsystem="prospective-parachains" err=FromOrigin { origin: "prospective-parachains", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-12-18 01:11:10 Essential task `overseer` failed. Shutting down service.    
2023-12-18 01:11:10 subsystem exited with error subsystem="dispute-coordinator" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
Error: 
   0: Other: Essential task failed.

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                 ⋮ 1 frame hidden ⋮                               
   2: polkadot::main::h4cca9d3491727cb7
      at <unknown source file>:<unknown line>
   3: std::sys_common::backtrace::__rust_begin_short_backtrace::h94782a592969dae8
      at <unknown source file>:<unknown line>
   4: main<unknown>
      at <unknown source file>:<unknown line>
   5: __libc_start_main<unknown>
      at <unknown source file>:<unknown line>
   6: _start<unknown>
      at <unknown source file>:<unknown line>

Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
Run with RUST_BACKTRACE=full to include source snippets.

parameters: --chain=polkadot --base-path=/chain-data --rpc-cors=all --port=30333 --unsafe-rpc-external --node-key=<xxx> --rpc-methods=Unsafe --name=<name> --telemetry-url="wss://telemetry-backend.w3f.community/submit 1" --public-addr=/dns4/<xxx>/tcp/23739 --in-peers=100 --in-peers-light=0 --db-cache=512

bkchr commented 9 months ago

Please provide more logs and also the logs and not a screenshot.

AlexZhenWang commented 9 months ago

Please provide more logs and also the logs and not a screenshot.

Thanks for the response @bkchr. Here are the logs

2023-12-18 01:10:33 ----------------------------    
2023-12-18 01:10:33 This chain is not in any way    
2023-12-18 01:10:33       endorsed by the           
2023-12-18 01:10:33      KUSAMA FOUNDATION          
2023-12-18 01:10:33 ----------------------------    
2023-12-18 01:10:33 Parity Polkadot    
2023-12-18 01:10:33 ✌️  version 1.5.0-a3dc2f15f23    
2023-12-18 01:10:33 ❤️  by Parity Technologies <admin@parity.io>, 2017-2023    
2023-12-18 01:10:33 📋 Chain specification: Kusama    
2023-12-18 01:10:33 🏷  Node name: OnfinalityV#1    
2023-12-18 01:10:33 👤 Role: AUTHORITY    
2023-12-18 01:10:33 💾 Database: RocksDb at /chain-data/chains/ksmcc3/db/full    
2023-12-18 01:11:01 🏷  Local node identity is: 12D3KooWHcDiChr6KkEfgCQtSiLbTzcDRhyKHyqtMvn515QWrWWt    
2023-12-18 01:11:01 🔍 Discovered new external address for our node: /ip4/142.215.53.19/tcp/30000/p2p/12D3KooWHcDiChr6KkEfgCQtSiLbTzcDRhyKHyqtMvn515QWrWWt    
2023-12-18 01:11:08 🚀 Using prepare-worker binary at: "/usr/lib/polkadot/polkadot-prepare-worker"    
2023-12-18 01:11:08 🚀 Using execute-worker binary at: "/usr/lib/polkadot/polkadot-execute-worker"    
2023-12-18 01:11:09 💻 Operating system: linux    
2023-12-18 01:11:09 💻 CPU architecture: x86_64    
2023-12-18 01:11:09 💻 Target environment: gnu    
2023-12-18 01:11:09 💻 CPU: Intel Xeon Processor (Cascadelake)    
2023-12-18 01:11:09 💻 CPU cores: 16    
2023-12-18 01:11:09 💻 Memory: 32115MB    
2023-12-18 01:11:09 💻 Kernel: 5.4.0-89-generic    
2023-12-18 01:11:09 💻 Linux distribution: Ubuntu 22.04.3 LTS    
2023-12-18 01:11:09 💻 Virtual machine: yes    
2023-12-18 01:11:09 📦 Highest known block at #21034600    
2023-12-18 01:11:09 Running JSON-RPC server: addr=127.0.0.1:9900, allowed origins=["*"]    
2023-12-18 01:11:09 🏁 CPU score: 1.02 GiBs    
2023-12-18 01:11:09 🏁 Memory score: 4.26 GiBs    
2023-12-18 01:11:09 🏁 Disk score (seq. writes): 626.57 MiBs    
2023-12-18 01:11:09 🏁 Disk score (rand. writes): 327.96 MiBs    
2023-12-18 01:11:09 ⚠️  The hardware does not meet the minimal requirements Failed checks: Copy(expected: 11.49 GiBs, found: 4.26 GiBs), Seq Write(expected: 950.00 MiBs, found: 626.57 MiBs), Rnd Write(expected: 420.00 MiBs, found: 327.96 MiBs),  for role 'Authority' find out more at:
https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware    
2023-12-18 01:11:09 👶 Starting BABE Authorship worker    
2023-12-18 01:11:09 🚨 Your system cannot securely run a validator. 
Running validation of malicious PVF code has a higher risk of compromising this machine.
  - Cannot enable landlock, a Linux 5.13+ kernel security feature: not available: Could not fully enable: NotEnforced
  - Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and mount namespaces: Operation not permitted (os error 1)
You can ignore this error with the `--insecure-validator-i-know-what-i-do` command line argument if you understand and accept the risks of running insecurely. With this flag, security features are enabled on a best-effort basis, but not mandatory. 
More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode
2023-12-18 01:11:09 🥩 BEEFY gadget waiting for BEEFY pallet to become available...    
2023-12-18 01:11:09 subsystem exited with error subsystem="candidate-validation" err=FromOrigin { origin: "candidate-validation", source: Context("could not enable Secure Validator Mode; check logs") }
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="candidate-validation"
2023-12-18 01:11:09 subsystem finished unexpectedly subsystem=Ok(())
2023-12-18 01:11:09 Received `Conclude` signal, exiting
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="pvf-checker"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="bitfield-signing"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="network-bridge-rx"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="candidate-backing"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="network-bridge-tx"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="availability-recovery"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="runtime-api"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="collation-generation"
2023-12-18 01:11:09 received `Conclude` signal, exiting
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="gossip-support"
2023-12-18 01:11:09 Conclude
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="availability-store"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="provisioner"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="bitfield-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="dispute-distribution"
2023-12-18 01:11:09 received `Conclude` signal, exiting
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="chain-selection"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="chain-api"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="statement-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="approval-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="availability-distribution"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="collator-protocol"
2023-12-18 01:11:09 Terminating due to subsystem exit subsystem="approval-voting"
2023-12-18 01:11:09 discovered: 12D3KooWPEovdRYLyAw8phHWXLmJncnRYM6euSvPiahthcYKdeLv /ip4/172.17.0.1/tcp/30333    
2023-12-18 01:11:09 discovered: 12D3KooWPEovdRYLyAw8phHWXLmJncnRYM6euSvPiahthcYKdeLv /ip4/142.215.53.19/tcp/30333    
2023-12-18 01:11:10 subsystem exited with error subsystem="prospective-parachains" err=FromOrigin { origin: "prospective-parachains", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-12-18 01:11:10 Essential task `overseer` failed. Shutting down service.    
2023-12-18 01:11:10 subsystem exited with error subsystem="dispute-coordinator" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
Error: 
   0: Other: Essential task failed.

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                 ⋮ 1 frame hidden ⋮                               
   2: polkadot::main::h4cca9d3491727cb7
      at <unknown source file>:<unknown line>
   3: std::sys_common::backtrace::__rust_begin_short_backtrace::h94782a592969dae8
      at <unknown source file>:<unknown line>
   4: main<unknown>
      at <unknown source file>:<unknown line>
   5: __libc_start_main<unknown>
      at <unknown source file>:<unknown line>
   6: _start<unknown>
      at <unknown source file>:<unknown line>

Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
Run with RUST_BACKTRACE=full to include source snippets.

And here are the parameters that I used when running the validator --chain=polkadot --base-path=/chain-data --rpc-cors=all --port=30333 --unsafe-rpc-external --node-key=<xxx> --rpc-methods=Unsafe --name=<name> --telemetry-url="wss://telemetry-backend.w3f.community/submit 1" --public-addr=/dns4/<xxx>/tcp/23739 --in-peers=100 --in-peers-light=0 --db-cache=512

alexggh commented 9 months ago

This is the root cause:

 - Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and mount namespaces: Operation not permitted (os error 1)
You can ignore this error with the `--insecure-validator-i-know-what-i-do` command line argument if you understand and accept the risks of running insecurely. With this flag, security features are enabled on a best-effort basis, but not mandatory. 
More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode
2023-12-18 01:11:09 🥩 BEEFY gadget waiting for BEEFY pallet to become available...    
2023-12-18 01:11:09 subsystem exited with error subsystem="candidate-validation" err=FromOrigin { origin: "candidate-validation", source: Context("could not enable Secure Validator Mode; check logs") }

I think you are hiting the same problem as here: https://github.com/paritytech/polkadot-sdk/issues/2662

mrcnski commented 9 months ago

Yes, looks like some new security features couldn't be enabled. "Operation not permitted" is interesting. Can you share details of your setup, is there anything unusual about it? Is your database path on a mount or have any special restrictions?

By the way, upgrading to Linux 5.13+ would make this part of the error go away, and the other part (unshare) becomes optional:

  - Cannot enable landlock, a Linux 5.13+ kernel security feature: not available: Could not fully enable: NotEnforced

If upgrading is not possible you can pass the CLI flag specified in the error.

AlexZhenWang commented 9 months ago

Yes, looks like some new security features couldn't be enabled. "Operation not permitted" is interesting. Can you share details of your setup, is there anything unusual about it? Is your database path on a mount or have any special restrictions?

By the way, upgrading to Linux 5.13+ would make this part of the error go away, and the other part (unshare) becomes optional:

  - Cannot enable landlock, a Linux 5.13+ kernel security feature: not available: Could not fully enable: NotEnforced

If upgrading is not possible you can pass the CLI flag specified in the error.

Thanks for the reply! @mrcnski

I am running the node as a pod in a k8s cluster. And the database is in a PVC that is mounted on the pod.

The related settings for the pod:

...
spec:
  containers:
  - args:
    - --base-path=/chain-data
...
    volumeMounts:
    - mountPath: /chain-data
      name: chaindata-volume
...
  volumes:
  - name: chaindata-volume
    persistentVolumeClaim:
      claimName: pvc-0
 ...
mrcnski commented 9 months ago

I am completely unfamiliar with kubernetes, but I presume the node is running in a container. That is probably why certain operations are not allowed, and maybe it depends on the container settings. For example if there is a seccomp sandbox it could be blocking the syscall, but I think this can be turned off. What is your Linux kernel version?

AlexZhenWang commented 9 months ago

Thanks @mrcnski. Yeah, I think you are right. Here is my Linux kernel version

5.4.0-89-generic
mrcnski commented 9 months ago

Thanks @AlexZhenWang! If it's possible, you can upgrade to Linux 5.13+; you'll still get a warning due to running in a container, but it won't be a hard error. Otherwise you'll need to pass --insecure-validator-i-know-what-i-do. Appreciate the report, it's important for us to know what is supported by actual configurations in production usage.

bkchr commented 7 months ago

https://github.com/paritytech/polkadot-sdk/pull/2486#issuecomment-1953412955 posting this here as I think it belongs here.

CC @s0me0ne-unkn0wn @matthewmarcus

maksimryndin commented 7 months ago

#2486 (comment) posting this here as I think it belongs here.

CC @s0me0ne-unkn0wn @matthewmarcus

@matthewmarcus Hi! Could you please provide an output/answer for the following commands/questions

  1. uname -a
  2. more detailed logs output from the console if possible (as a text like this https://github.com/paritytech/polkadot-sdk/issues/2728#issuecomment-1859381980)
  3. in the terminal on the machine
      echo "      #include <fcntl.h>
      #include <linux/landlock.h>
      #include <sys/syscall.h>
      #include <unistd.h>

      int main (int argc, char *argv[]) {
        /* Get supported landlock ABI */
        int abi = syscall (SYS_landlock_create_ruleset, NULL, 0, LANDLOCK_CREATE_RULESET_VERSION);
        return abi < 0 ? 1 : 0;
      }
     " > landlock_test.c

clang landlock_test.c -o landlock_test

./landlock_test

echo $? <--- what is an output? is it 0?

Thank you!

s0me0ne-unkn0wn commented 7 months ago

Landlock is optional, anyway. The main problem to me is that unshare() fails, and clone() fails as a consequence. One obvious reason for failing unshare() is running the node inside a chroot environment or another type of jail. If that's not the case, I'm out of ideas right now, but I'll look into the kernel code a bit later to learn why it may fail. CC @koute just in case

matthewmarcus commented 7 months ago

#2486 (comment) posting this here as I think it belongs here. CC @s0me0ne-unkn0wn @matthewmarcus

@matthewmarcus Hi! Could you please provide an output/answer for the following commands/questions

  1. uname -a
  2. more detailed logs output from the console if possible (as a text like this Got Essential task overseer failed error after upgrading Kusama and Polkadot validator to v1.5.0 #2728 (comment))
  3. in the terminal on the machine
      echo "      #include <fcntl.h>
      #include <linux/landlock.h>
      #include <sys/syscall.h>
      #include <unistd.h>

      int main (int argc, char *argv[]) {
        /* Get supported landlock ABI */
        int abi = syscall (SYS_landlock_create_ruleset, NULL, 0, LANDLOCK_CREATE_RULESET_VERSION);
        return abi < 0 ? 1 : 0;
      }
     " > landlock_test.c

clang landlock_test.c -o landlock_test

./landlock_test

echo $? <--- what is an output? is it 0?

Thank you!

Hey. Thanks for the reply.

Here are the results:

uname -r = 6.7.5-060705-generic

When attempting to run clang landlock_test.c -o landlock_test I get this error:

landlock_test.c:2:16: fatal error: 'linux/landlock.h' file not found #include <linux/landlock.h> ^~~~~~~~~~~~~~~~~~ 1 error generated.

s0me0ne-unkn0wn commented 7 months ago

@matthewmarcus what Linux distribution is that and what architecture you're running on?

matthewmarcus commented 7 months ago

@matthewmarcus what Linux distribution is that and what architecture you're running on?

@s0me0ne-unkn0wn

Ubuntu 20.04

Intel NUC w/ Core i7-8559U processor

s0me0ne-unkn0wn commented 7 months ago

@matthewmarcus Canonical hasn't officially released 6.7.5 kernel for 20.04 AFAIK. Do you use Mainline or some other kernel manager?

s0me0ne-unkn0wn commented 7 months ago

@matthewmarcus I haven't been using Ubuntu for quite some time now, but from what I googled quickly, for 20.04 the supported version in the HWE stack is 5.15, and the 6.7.5 most probably comes from mainline builds. The mainline builds are not supported, not guaranteed to work, and not recommended for production use. I don't say it's definitely a problem, but if you could try to run your node after booting from the officially supported 5.15 kernel from Ubuntu distro, you could probably save us a lot of debugging time :)

I personally run 6.7.0 from the Manjaro distro, and I don't have any issues with secure validator mode, but that's not exactly the same as the mainline builds.

maksimryndin commented 7 months ago

@matthewmarcus Canonical hasn't officially released 6.7.5 kernel for 20.04 AFAIK. Do you use Mainline or some other kernel manager?

Yeah, I've spawned a test Ubuntu 20.04 amd64 machine and the latest available through the apt is linux-image-5.8.0-63-lowlatency/focal-updates,focal-security 5.8.0-63.71~20.04.1 amd64. Perhaps, is it some manually built unsigned image and that's why the secure boot is disabled? @matthewmarcus

s0me0ne-unkn0wn commented 7 months ago

@maksimryndin how about apt install --install-recommends linux-generic-hwe-20.04? In theory, it should get you 5.15, at least if you're on the latest 20.04.5 LTS update

maksimryndin commented 7 months ago

apt install --install-recommends linux-generic-hwe-20.04

@s0me0ne-unkn0wn yeah, you're right :) 5.15. Exactly!

matthewmarcus commented 7 months ago

@matthewmarcus Canonical hasn't officially released 6.7.5 kernel for 20.04 AFAIK. Do you use Mainline or some other kernel manager?

I used the ubuntu-mainline-kernel.sh script as described in this post:

https://askubuntu.com/questions/1388115/how-do-i-update-my-kernel-to-the-latest-one

matthewmarcus commented 7 months ago

@matthewmarcus I haven't been using Ubuntu for quite some time now, but from what I googled quickly, for 20.04 the supported version in the HWE stack is 5.15, and the 6.7.5 most probably comes from mainline builds. The mainline builds are not supported, not guaranteed to work, and not recommended for production use. I don't say it's definitely a problem, but if you could try to run your node after booting from the officially supported 5.15 kernel from Ubuntu distro, you could probably save us a lot of debugging time :)

I personally run 6.7.0 from the Manjaro distro, and I don't have any issues with secure validator mode, but that's not exactly the same as the mainline builds.

Well, the kernel we were using prior to 6.7.5 was 5.15.0-88-generic and that was giving us the same errors (see https://github.com/paritytech/polkadot-sdk/pull/2486#issuecomment-1953412955). So unless one of the minor builds after 5.15.0-88 fixed the issue, the 5.15 kernel isn't working either.

matthewmarcus commented 7 months ago

@matthewmarcus Canonical hasn't officially released 6.7.5 kernel for 20.04 AFAIK. Do you use Mainline or some other kernel manager?

Yeah, I've spawned a test Ubuntu 20.04 amd64 machine and the latest available through the apt is linux-image-5.8.0-63-lowlatency/focal-updates,focal-security 5.8.0-63.71~20.04.1 amd64. Perhaps, is it some manually built unsigned image and that's why the secure boot is disabled? @matthewmarcus

That's interesting b/c we've never manually updated the kernel on this machine since originally installing Ubuntu 20.04, and the kernel it chose (?) to use was 5.15.0-88-generic. I did notice there were a boat load of other kernels on the box as well, but I removed them in an attempt to free up some disk space. Removing them, tho, did not free up any disk space. :) @maksimryndin

matthewmarcus commented 7 months ago

@maksimryndin how about apt install --install-recommends linux-generic-hwe-20.04? In theory, it should get you 5.15, at least if you're on the latest 20.04.5 LTS update

Just ran the lsb_release -a command on our box and it revealed:

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.6 LTS Release: 20.04 Codename: focal

Sharing in case it helps with debugging. @s0me0ne-unkn0wn

Also, ran this command sudo apt install --install-recommends linux-generic-hwe-20.04 which resulted in these errors:

Reading package lists... Done Building dependency tree Reading state information... Done You might want to run 'apt --fix-broken install' to correct these. The following packages have unmet dependencies: linux-generic-hwe-20.04 : Depends: linux-image-generic-hwe-20.04 (= 5.15.0.94.104~20.04.50) but it is not going to be installed Depends: linux-headers-generic-hwe-20.04 (= 5.15.0.94.104~20.04.50) but it is not going to be installed linux-headers-6.7.5-060705-generic : Depends: libc6 (>= 2.38) but 2.31-0ubuntu9.14 is to be installed Depends: libssl3 (>= 3.0.0) but it is not installable E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).

matthewmarcus commented 7 months ago

Did I scare everyone away? @s0me0ne-unkn0wn @maksimryndin

s0me0ne-unkn0wn commented 7 months ago

@matthewmarcus what does the OS-suggested apt --fix-broken install --install-recommends linux-generic-hwe-20.04 propose? Does it look like it wants to break your system completely if applied? :upside_down_face:

Honestly, I'd try to install Ubuntu from scratch, not using mainline builds and using the supported HWE stack. There's nothing special about NUC hardware that might prevent proper sandboxing AFAIK (@koute ?) so that's most probably a kernel issue. If you're able to sort it out with --fix-broken, that's okay, but if the system is seriously messed up, it's sometimes just easier to re-install.

koute commented 7 months ago

There's nothing special about NUC hardware that might prevent proper sandboxing AFAIK (@koute ?) so that's most probably a kernel issue.

Well, there can be a few reasons why this doesn't work, but AFAIK usually the reason is that the environment is configured to disallow unprivileged users to create user namespaces. Some Linux distribution might be configured in such a way by default, and some containerization software (Docker/Podman/Kubernetes/insert you trendy alternative of the week) might also disallow it.

So the fundamental question here is: does this happen because of how the environment is configured, or does this happen because our code doesn't handle some corner case?

If it's the former - that's an unsupported configuration, and we should document this and tell the users how to fix it. (Ideally we could detect this exact situation and have the node print out a helpful error message.) If it's the latter - we need to fix their code.

Either way the fastest way to investigate and fix it is probably something like this:

1) Ask the user about their environment (which Linux distro, which exact kernel, bare metal or VM, is it Docker/Podman/Kubernetes/whatever and if it is the exact configuration, etc.). 2) Replicate the same environment ourselves in a VM. 3) See why it fails, and either document on how the users can reconfigure their environment or fix our code. (And if it doesn't fail we can then probe the user further to figure out what's different in their environment as opposed to ours, and then rinse and repeat.)

(At least that's what I would do.)

matthewmarcus commented 7 months ago

apt --fix-broken install --install-recommends linux-generic-hwe-20.04

@s0me0ne-unkn0wn I'm out of town at the moment, and don't want to issue that command until I return (this Wed) just in case it breaks the entire system. I'll let you know when I try and report back.

As for reinstalling Ubuntu, I have several nodes/systems running on this platform so doing that would be a real undertaking and result in significant down time. I would only want to do that as a last resort.

@maksimryndin has reached out and we're gonna look at the problem together in the coming days. If we figure anything out, we'll be sure to let you know.

maksimryndin commented 7 months ago

So with @matthewmarcus we have figured out the actual reason - polkadot ran behind too restrictive systemd unit configuration (which also turned off an ability to create namespaces). Nothing special about the system itself.

Linux Good-KarMa 6.7.5-060705-generic #202402161836 SMP PREEMPT_DYNAMIC Fri Feb 16 19:10:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Matthew had created that restrictive systemd service way before the introduced security features for pvf. And when he updated for the release with security features added, they couldn't be enabled

We turned off systemd restrictions in favor of native polkadot security mechanisms and everything works as expected. @s0me0ne-unkn0wn @bkchr @koute

s0me0ne-unkn0wn commented 7 months ago

@maksimryndin oh wow, thanks a lot for investigating that! Can you please elaborate on what restrictions were in force? It's probably worth mentioning in the documentation to avoid other users hitting the problem.

matthewmarcus commented 7 months ago

Yes! Many many thanks to @maksimryndin for his excellent guidance and support today. We spent several hours attempting to figure out the issue only to find, as was mentioned, my systemd config for the service was much too restrictive.

@s0me0ne-unkn0wn

Here is a portion of the config file. As you can see, once we commented all of the unnecessary parameters, the service worked perfectly.

[Service]
User=polkadot
Group=polkadot
Type=simple
Restart=always
RestartSec=120
# MemoryHigh=5400M
# MemoryMax=5500M
# CapabilityBoundingSet=
# LockPersonality=true
# NoNewPrivileges=true
# PrivateDevices=true
# PrivateMounts=true
# PrivateTmp=true
# PrivateUsers=true
# ProtectClock=true
# ProtectControlGroups=true
# ProtectHostname=true
# ProtectKernelModules=true
# ProtectKernelTunables=true
# ProtectSystem=strict
# ReadWritePaths=/media/maxdrive/polkadot
# RemoveIPC=true
# RestrictAddressFamilies=AF_INET AF_INET6 AF_NETLINK AF_UNIX
# RestrictNamespaces=true
# RestrictSUIDSGID=true
# SystemCallArchitectures=native
# SystemCallFilter=@system-service
# SystemCallFilter=landlock_add_rule landlock_create_ruleset landlock_restrict_self seccomp
# SystemCallFilter=@sandbox
# SystemCallFilter=@obsolete
# SystemCallFilter=seccomp
# SystemCallErrorNumber=EPERM
# SystemCallFilter=~@clock @module @mount @reboot @swap @privileged
# UMask=0027
maksimryndin commented 7 months ago

@maksimryndin oh wow, thanks a lot for investigating that! Can you please elaborate on what restrictions were in force? It's probably worth mentioning in the documentation to avoid other users hitting the problem.

So I would advise users in case of a similar issue try to check and to turn off systemd security-related settings in favor of native polkadot security features. And I believe we should come up with a standard template for troubleshooting such kind of things (I can try to prepare a testing script and come up with a Github issue template suggestions).

By the way, during our experiments (we tried to run zombienet first to avoid touching a running validator) we encountered an issue (filed here https://github.com/paritytech/zombienet/issues/1737).

So,

General-Beck commented 6 months ago

In the case of creating a script for verification, I would recommend the following order:

  1. Checking the version of the current kernel uname -r at the very beginning of the script (for Ubuntu starting with release 20.04.6, 22.04.4, a sufficient condition is the availability of the kernel version 5.15.0-100 and higher)

  2. If necessary, update the kernel to the latest HWE version: sudo apt --fix-broken install --install-recommends linux-generic-hwe-20.04 for 20.04.6 and sudo apt --fix-broken install --install-recommends linux-generic-hwe-22.04 for 22.04

  3. Check the correct operation of the Landlock: root@beck-home-desktop:/home/denis#sudo dmesg | grep landlock || journalctl -kg landlock [0.547818] LSM: initializing lsm=lockdown,capability,landlock,yama,apparmor,integrity [0.547832] landlock: Up and running.

  4. Most likely, warnings such as this will remain at startup: mar 11 08:14:44 beck-home-desktop polkadot[11531]: 2024-03-11 08:14:44 🚨 Some security issues have been detected. Mar 11 08:14:44 beck-home-desktop polkadot[11531]: Running validation of malicious PVF code has a higher risk of compromising this machine. Mar 11 08:14:44 beck-home-desktop polkadot[11531]: - Optional: Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and mount namespaces: Operation not allowed (os error 1) Mar 11 08:14:44 beck-home-desktop polkadot[11531]: - Optional: Cannot call clone with all sandboxing flags, a Linux-specific kernel security features: not available: could not clone, errno: EPERM: Operation not allowed Mar 11 08:14:44 beck-home-desktop polkadot[11531]: 2024-03-11 08:14:44 👮♀️ Running in Secure Validator Mode. It is highly recommended that you operate according to our security guidelines. Mar 11 08:14:44 beck-home-desktop polkadot[11531]: More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode In this case, you need to fix the systemd polkadot.service startup script, leaving only the following parameters (I think they are enough, since the binary file has its own strict verification): root@beck-home-desktop:/home/denis# nano /etc/systemd/system/multi-user.target.wants/polkadot.service [Unit] Description=Polkadot Node After=network.target Documentation=https://github.com/paritytech/polkadot [Service] EnvironmentFile=-/etc/default/polkadot ExecStart=/usr/bin/polkadot $POLKADOT_CLI_ARGS User=polkadot Group=polkadot Restart=always RestartSec=120 [Install] WantedBy=multi-user.target

  5. Next, make changes to the system and restart the service root@beck-home-desktop:/home/denis# systemctl daemon-reload root@beck-home-desktop:/home/denis# systemctl restart polkadot root@beck-home-desktop:/home/denis# journalctl -fu polkadot Mar 11 08:23:56 beck-home-desktop polkadot[11804]: 2024-03-11 08:23:56 👶 Starting BABE Authorization worker mar 11 08:23:56 beck-home-desktop polkadot[11804]: 2024-03-11 08:23:56 🥩 BEEFY gadget waiting for BEEFY pallet to become available... mar 11 08:23:56 beck-home-desktop polkadot[11804]: 2024-03-11 08:23:56 👮♀️ Running in Secure Validator Mode. It is highly recommended that you operate according to our security guidelines. Mar 11 08:23:56 beck-home-desktop polkadot[11804]: More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode

Tested on Ubuntu 20.04.6, 22.04.4, 23.10. polkadot version 1.8.0-ec7817e5ad