hashicorp / nomad-driver-podman

A nomad task driver plugin for sandboxing workloads in podman containers
https://developer.hashicorp.com/nomad/plugins/drivers/podman
Mozilla Public License 2.0
224 stars 61 forks source link

Nomad 1.7.x breaks support for ARM deployments #310

Closed zyclonite closed 5 months ago

zyclonite commented 6 months ago

After upgrading to 1.7.2: It seems that the placement fails for the task as it does not find a matching CPU architecture stating: Dimension cpu exhausted on 1 node

The instance i am using is a graviton (AWS). Doing the same test one AMD or Intel works.

I did not do the same test with the default docker plugin, maybe it's a general issue with the fingerprinting of the node If i compare the old nomad 1.6.5 with the new 1.7.2 branch i see differences from the cpu side OLD

image

NEW

image

upgrading one node directly from 1.6.5 to 1.7.2 would result in the following logs when trying to restart a task that was already running

nomad[4205]:     2023-12-15T15:24:27.036Z [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=6ccd142f-d9c6-cc0c-9353-46ced346e5aa task=****
nomad[4205]:   error=
nomad[4205]:   | prestart hook "logmon" failed: Unrecognized remote plugin message: 
nomad[4205]:   | This usually means
nomad[4205]:   |   the plugin was not compiled for this architecture,
nomad[4205]:   |   the plugin is missing dynamic-link libraries necessary to run,
nomad[4205]:   |   the plugin is not executable by this process due to file permissions, or
nomad[4205]:   |   the plugin failed to negotiate the initial go-plugin protocol handshake
nomad[4205]:   | 
nomad[4205]:   | Additional notes about plugin:
nomad[4205]:   |   Path: /usr/bin/nomad
nomad[4205]:   |   Mode: -rwxr-xr-x
nomad[4205]:   |   Owner: 0 [root] (current: 0 [root])
nomad[4205]:   |   Group: 0 [root] (current: 0 [root])
nomad[4205]:   |   ELF architecture: EM_AARCH64 (current architecture: arm64)

some more details:

Procsiab commented 6 months ago

Hello there, I have a similar platform to this issue's author, however I am not able to reproduce the issue on a RaspberryPi hardware:

I am able to plan and run a job file through the CLI. Could this be related to the fact that I use Nomad directly to collect centralized logs?

zyclonite commented 6 months ago

seems after the cpu fingerprinting refactoring the dmidecode package was missing on the host

would be still great if the fallback is again 1000mhz if the detection fails, that would keep the behaviour similar to the old version

apollo13 commented 5 months ago

I guess this can be closed here, since it would have to be fixed in nomad itself: https://github.com/hashicorp/nomad/issues/19412