Open sfxworks opened 1 year ago
It looks like it doesn't respect kubelet cpu static policy.
hmm I'm not too falimiar with cpu policies but seems this may be true. @Abhinandan-Purkait ? The io-engine tries to affinitize to the specified core list configure in the helm chart (default from your chart would be taken from core-count, so 1,2 I think). Did you isolate cores 1 and 2? I wonder if that would sidestep the policy.
Hi, I am getting the same error on 2 servers, while the third one managed to start the pod:
[2023-08-10T02:11:01.133568146+00:00 INFO io_engine:io-engine.rs:179] Engine responsible for managing I/Os version 1.0.0, revision b0734db654d8 (v2.0.0)
[2023-08-10T02:11:01.133812452+00:00 INFO io_engine:io-engine.rs:158] free_pages 2MB: 1024 nr_pages 2MB: 1024
[2023-08-10T02:11:01.133829859+00:00 INFO io_engine:io-engine.rs:159] free_pages 1GB: 0 nr_pages 1GB: 0
[2023-08-10T02:11:01.134049851+00:00 INFO io_engine:io-engine.rs:211] kernel io_uring support: yes
[2023-08-10T02:11:01.134079945+00:00 INFO io_engine:io-engine.rs:215] kernel nvme initiator multipath support: yes
[2023-08-10T02:11:01.134165623+00:00 INFO io_engine::core::env:env.rs:791] loading mayastor config YAML file /var/local/io-engine/config.yaml
[2023-08-10T02:11:01.134191763+00:00 INFO io_engine::subsys::config:mod.rs:168] Config file /var/local/io-engine/config.yaml is empty, reverting to default config
[2023-08-10T02:11:01.134213488+00:00 INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2023-08-10T02:11:01.134239781+00:00 INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVME_QPAIR_CONNECT_ASYNC value to 'true'
[2023-08-10T02:11:01.134251732+00:00 INFO io_engine::subsys::config:mod.rs:216] Applying Mayastor configuration settings
EAL: FATAL: Cannot set affinity
EAL: Cannot set affinity
thread 'main' panicked at 'Failed to init EAL', io-engine/src/core/env.rs:628:13
stack backtrace:
0: std::panicking::begin_panic
1: io_engine::core::env::MayastorEnvironment::init
2: io_engine::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
I'm using microk8s and installed mayastor via add-on. Kubernetes version is 1.27 with mayastor 2.0.0. The resources after creation:
~ ❯ kubectl get pod -n mayastor
NAME READY STATUS RESTARTS AGE
mayastor-csi-node-qwt5m 2/2 Running 0 59m
mayastor-csi-node-l64bd 2/2 Running 0 59m
etcd-wcckw7dkcs 1/1 Running 0 58m
etcd-pcf79w5kxn 1/1 Running 0 58m
mayastor-agent-core-f7ccf485-tzszv 1/1 Running 2 (57m ago) 59m
mayastor-operator-diskpool-5b4cfb555b-pht6l 1/1 Running 0 59m
mayastor-api-rest-bcb58d479-v7jm9 1/1 Running 0 59m
etcd-operator-mayastor-8574f998bc-q2z8z 1/1 Running 1 (55m ago) 59m
mayastor-csi-controller-6b867dd474-grwcw 3/3 Running 0 59m
mayastor-csi-node-m6ksd 2/2 Running 4 (19m ago) 59m
etcd-s86jdxw5v8 1/1 Running 2 (19m ago) 57m
mayastor-io-engine-9h6bg 1/1 Running 2 (19m ago) 59m
mayastor-io-engine-bd8zz 0/1 CrashLoopBackOff 5 (73s ago) 4m19s
mayastor-io-engine-szvcv 0/1 CrashLoopBackOff 5 (50s ago) 4m6s
As you can see 2 mayastor-io-engine
failing.
If not the core count, could that be the CPU frequency too low? The server that managed to start mayastor-io-engine runs at 3.0 Ghz, while the 2 servers that failed have a lower spec CPU running at 1.7 Ghz. I would not want to change the CPUs right now, so is there another way?
How many cpu cores on these 2 servers?
I have allocated 8 cores, 16 GB of RAM, and 64 GB of space, on all 3 servers. I will try to add more cores - 32, and will get back with the results.
Update
Added 32 to cores to LXC container running microk8s. Rebooted the container and added RUST_BACKTRACE=full
to the mayastor_io_engine daemon set. Getting the same error:
[2023-08-10T18:58:35.477169774+00:00 INFO io_engine:io-engine.rs:179] Engine responsible for managing I/Os version 1.0.0, revision b0734db654d8 (v2.0.0)
[2023-08-10T18:58:35.477449869+00:00 INFO io_engine:io-engine.rs:158] free_pages 2MB: 1024 nr_pages 2MB: 1024
[2023-08-10T18:58:35.477467622+00:00 INFO io_engine:io-engine.rs:159] free_pages 1GB: 0 nr_pages 1GB: 0
[2023-08-10T18:58:35.477682164+00:00 INFO io_engine:io-engine.rs:211] kernel io_uring support: yes
[2023-08-10T18:58:35.477713263+00:00 INFO io_engine:io-engine.rs:215] kernel nvme initiator multipath support: yes
[2023-08-10T18:58:35.477806753+00:00 INFO io_engine::core::env:env.rs:791] loading mayastor config YAML file /var/local/io-engine/config.yaml
[2023-08-10T18:58:35.477831688+00:00 INFO io_engine::subsys::config:mod.rs:168] Config file /var/local/io-engine/config.yaml is empty, reverting to default config
[2023-08-10T18:58:35.477856564+00:00 INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2023-08-10T18:58:35.477875581+00:00 INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVME_QPAIR_CONNECT_ASYNC value to 'true'
[2023-08-10T18:58:35.477896816+00:00 INFO io_engine::subsys::config:mod.rs:216] Applying Mayastor configuration settings
EAL: FATAL: Cannot set affinity
EAL: Cannot set affinity
thread 'main' panicked at 'Failed to init EAL', io-engine/src/core/env.rs:628:13
stack backtrace:
0: 0x563edae8c63c - std::backtrace_rs::backtrace::libunwind::trace::h3fea1eb2e0ba2ac9
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5
1: 0x563edae8c63c - std::backtrace_rs::backtrace::trace_unsynchronized::h849d83492cbc0d59
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x563edae8c63c - std::sys_common::backtrace::_print_fmt::he3179d37290f23d3
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/sys_common/backtrace.rs:67:5
3: 0x563edae8c63c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h140f6925cad14324
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/sys_common/backtrace.rs:46:22
4: 0x563edaeb3a8c - core::fmt::write::h31b9cd1bedd7ea38
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/fmt/mod.rs:1150:17
5: 0x563edae85485 - std::io::Write::write_fmt::h1fdf66f83f70913e
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/io/mod.rs:1667:15
6: 0x563edae8e670 - std::sys_common::backtrace::_print::he7ac492cd19c3189
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/sys_common/backtrace.rs:49:5
7: 0x563edae8e670 - std::sys_common::backtrace::print::hba20f8920229d8e8
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/sys_common/backtrace.rs:36:9
8: 0x563edae8e670 - std::panicking::default_hook::{{closure}}::h714d63979ae18678
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:210:50
9: 0x563edae8e227 - std::panicking::default_hook::hf1afb64e69563ca8
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:227:9
10: 0x563edae8ed24 - std::panicking::rust_panic_with_hook::h02231a501e274a13
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:624:17
11: 0x563edaa4c865 - std::panicking::begin_panic::{{closure}}::h7a63bfeb662f20ad
12: 0x563edaa4a5e4 - std::sys_common::backtrace::__rust_end_short_backtrace::h4247f61ed8ce89f4
13: 0x563eda2db9fc - std::panicking::begin_panic::h2a5b2d5b2df0b927
14: 0x563eda63ed57 - io_engine::core::env::MayastorEnvironment::init::h00d4823a049822b2
15: 0x563eda5313ec - io_engine::main::hf80554fcb427d3c4
16: 0x563eda568183 - std::sys_common::backtrace::__rust_begin_short_backtrace::h4ead7c1f369eb43e
17: 0x563eda53ebed - std::rt::lang_start::{{closure}}::h58a35d1e00786750
18: 0x563edae8f32a - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h2790017aba790142
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:259:13
19: 0x563edae8f32a - std::panicking::try::do_call::hd5d0fbb7d2d2d85d
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:403:40
20: 0x563edae8f32a - std::panicking::try::h675520ee37b0fdf7
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:367:19
21: 0x563edae8f32a - std::panic::catch_unwind::h803430ea0284ce79
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panic.rs:129:14
22: 0x563edae8f32a - std::rt::lang_start_internal::{{closure}}::h3a398a8154de3106
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/rt.rs:45:48
23: 0x563edae8f32a - std::panicking::try::do_call::hf60f106700df94b2
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:403:40
24: 0x563edae8f32a - std::panicking::try::hb2022d2bc87a9867
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:367:19
25: 0x563edae8f32a - std::panic::catch_unwind::hbf801c9d61f0c2fb
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panic.rs:129:14
26: 0x563edae8f32a - std::rt::lang_start_internal::hdd488b91dc742b96
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/rt.rs:45:20
27: 0x563eda532e42 - main
28: 0x7f3dac00eded - __libc_start_main
29: 0x563eda2fdf2a - _start
at /build/glibc-2.32/csu/../sysdeps/x86_64/start.S:120
30: 0x0 - <unknown>
On the other server, that still has 8 cores, I get slightly different output
[2023-08-10T19:04:14.476441862+00:00 INFO io_engine:io-engine.rs:179] Engine responsible for managing I/Os version 1.0.0, revision b0734db654d8 (v2.0.0)
[2023-08-10T19:04:14.476619998+00:00 INFO io_engine:io-engine.rs:158] free_pages 2MB: 1024 nr_pages 2MB: 1024
[2023-08-10T19:04:14.476630074+00:00 INFO io_engine:io-engine.rs:159] free_pages 1GB: 0 nr_pages 1GB: 0
[2023-08-10T19:04:14.476755343+00:00 INFO io_engine:io-engine.rs:211] kernel io_uring support: yes
[2023-08-10T19:04:14.476788992+00:00 INFO io_engine:io-engine.rs:215] kernel nvme initiator multipath support: yes
[2023-08-10T19:04:14.476839572+00:00 INFO io_engine::core::env:env.rs:791] loading mayastor config YAML file /var/local/io-engine/config.yaml
[2023-08-10T19:04:14.476854233+00:00 INFO io_engine::subsys::config:mod.rs:168] Config file /var/local/io-engine/config.yaml is empty, reverting to default config
[2023-08-10T19:04:14.476863175+00:00 INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2023-08-10T19:04:14.476872222+00:00 INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVME_QPAIR_CONNECT_ASYNC value to 'true'
[2023-08-10T19:04:14.476878751+00:00 INFO io_engine::subsys::config:mod.rs:216] Applying Mayastor configuration settings
PANIC in rte_eal_init():
Cannot set affinity
11: [io-engine(+0x13af2a) [0x563c69519f2a]]
10: [/nix/store/sbbifs2ykc05inws26203h0xwcadnf0l-glibc-2.32-46/lib/libc.so.6(__libc_start_main+0xed) [0x7f802e1d1ded]]
9: [io-engine(+0x36fe42) [0x563c6974ee42]]
8: [io-engine(+0xccc32a) [0x563c6a0ab32a]]
7: [io-engine(+0x37bbed) [0x563c6975abed]]
6: [io-engine(+0x3a5183) [0x563c69784183]]
5: [io-engine(+0x36e3ec) [0x563c6974d3ec]]
4: [io-engine(+0x47ae78) [0x563c69859e78]]
3: [/nix/store/8lijpmw0rwja558780llanxmmvr572zi-io-engine/lib/libspdk-bundle.so(+0x915ee) [0x7f802e58c5ee]]
2: [/nix/store/8lijpmw0rwja558780llanxmmvr572zi-io-engine/lib/libspdk-bundle.so(__rte_panic+0xb6) [0x7f802e5880b9]]
1: [/nix/store/8lijpmw0rwja558780llanxmmvr572zi-io-engine/lib/libspdk-bundle.so(rte_dump_stack+0x1b) [0x7f80310abfab]]```
@tiagolobocastro Any ideas?
Is there some kind of limit to your lxc container to run on a subset of your cpus? Also I noticed you're running on v2.0.0, could could move to 2.3.0, though I suspect that won't help in this case.
I tried to install v2.3.0 from a chart and it did not help. There are no limits for LXC container. I decided to upgrade CPUs, and if it helps I will post an update.
If it doesn't help, would you be able to change io-engine container image to something else that would allow you to run this from the container:
grep Cpus_allowed_list /proc/self/status
Also, do you have a cpu manager policy of static?
I've tested this with lxd, and when we limit lxc containers to cpu, indeed I start to see the cpu allowed list being setup by lxc, example:
root@ksnode-2:~# grep Cpus_allowed_list /proc/self/status Cpus_allowed_list: 2,9,12
In this case to get io-engine to run I had to change the cpu-list to those... I think we may need to tweak the io-engine dataplane cpu affinity for it to be more compatible with lxd and similar configurations.
@tiagolobocastro, sorry I forgot to update. I have replaced the CPUs, but that did not resolve the issue.
I think one of the issues I have experienced with k8s in LXC and various storage solutions, including ebs, ceph(csi driver), and others, was the inability to mount new drive inside the lxc container(even though it was privileged). Can't remember exactly why, but it seems like a limitation of LXD all together. I did find a post regarding this...
I ultimately just installed k8s bare bones on the server and most of those issues disappeared. I'm sure if I would try to run open ebs, it would work. So the issue is most likely related to running Kubernetes inside LXC.
Describe the bug First install of mayastor, I'm getting a "Cannot set affinity" error
To Reproduce Steps to reproduce the behavior:
Expected behavior
OS info (please complete the following information):
Additional context One is fine. I also tried giving mayastor dedicated cpus and running helm upgrade. This lead to an etcd issue though.