openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
724 stars 105 forks source link

Mayastor 2.7.0 docker images will not start (`exec format error`) #1697

Closed adamcharnock closed 1 month ago

adamcharnock commented 1 month ago

Description

Mayastor 2.7.0 docker images issue a exec format error when starting. This was not the case for 2.6.1 images. This issue is present on only 1 of our cluster nodes (total: 3). All systems are identical, all running Xeon CPUs.

To Reproduce

(prod) worker1 ~ ❱ ctr image pull docker.io/openebs/mayastor-io-engine:v2.7.0
docker.io/openebs/mayastor-io-engine:v2.7.0:                                      resolved       |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:b5633ab59f26e0e54870b779f7c6d0a6349bf12bc2c353d472f411d3028ebcc8: done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:f20fde36ee60f29d5a216d65c9ddd077b032c7959d9e06a3db1924bd55d517a8:    done           |++++++++++++++++++++++++++++++++++++++|
config-sha256:8d057d7d555c3319fd6fe5d467ecf54799773f8754be857d1871570fe0b82341:   done           |++++++++++++++++++++++++++++++++++++++|
elapsed: 0.8 s                                                                    total:   0.0 B (0.0 B/s)
unpacking linux/amd64 sha256:b5633ab59f26e0e54870b779f7c6d0a6349bf12bc2c353d472f411d3028ebcc8...
done: 6.168198ms
(prod) worker1 ~ ❱ ctr run docker.io/openebs/mayastor-io-engine:v2.7.0 test
exec /bin/io-engine: exec format error

I've only ever seen this error in relation to incompatible docker image architectures, but I cannot see how that could be the case here.

Expected behavior

Image should start as it did with the 2.6.1 image:

(prod) worker1 ~ ❱ ctr run docker.io/openebs/mayastor-io-engine:v2.6.1 test
[2024-07-17T19:12:38.298253713+00:00  INFO io_engine:io-engine.rs:242] Engine responsible for managing I/Os version 1.0.0, revision 58b7ecc18b2f (v2.6.1+0)
[2024-07-17T19:12:38.298351803+00:00  INFO io_engine:io-engine.rs:221] free_pages 2MB: 7729 nr_pages 2MB: 8192
... etc etc ...

OS Info (erroring system)

Linux worker1 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

OS Info (working systems)

Linux worker1 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Updates

Update 1: I removed the imaged, pruned, and re-pulled a fresh download. Same error. Doesn't seem to be a corruption issue.

Update 2: I rebooted 5.15.0-113-generic node into kernel 5.15.0-116-generic and the container still starts. So it looks like there is something weird up with this first node.

Update 3: Drained the affected node of pods, did a nerdctl system prune --all. Image still wouldn't start

Update 4: Pulling in a developing sha from the docker registry instead. No idea what has happened to this system, but I think it is related to this containerd issue. My system did suffer an unexpected reboot (caused, I'm fairly sure, by Mayastor/nvme), which may be the cause. In any case, time to close this issue.

tiagolobocastro commented 1 month ago

Glad it's resolved, thanks for the Updates @adamcharnock