Nomad 1.7.x breaks support for ARM deployments

zyclonite commented 6 months ago

After upgrading to 1.7.2: It seems that the placement fails for the task as it does not find a matching CPU architecture stating: Dimension cpu exhausted on 1 node

The instance i am using is a graviton (AWS). Doing the same test one AMD or Intel works.

I did not do the same test with the default docker plugin, maybe it's a general issue with the fingerprinting of the node If i compare the old nomad 1.6.5 with the new 1.7.2 branch i see differences from the cpu side OLD

NEW

upgrading one node directly from 1.6.5 to 1.7.2 would result in the following logs when trying to restart a task that was already running

nomad[4205]:     2023-12-15T15:24:27.036Z [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=6ccd142f-d9c6-cc0c-9353-46ced346e5aa task=****
nomad[4205]:   error=
nomad[4205]:   | prestart hook "logmon" failed: Unrecognized remote plugin message: 
nomad[4205]:   | This usually means
nomad[4205]:   |   the plugin was not compiled for this architecture,
nomad[4205]:   |   the plugin is missing dynamic-link libraries necessary to run,
nomad[4205]:   |   the plugin is not executable by this process due to file permissions, or
nomad[4205]:   |   the plugin failed to negotiate the initial go-plugin protocol handshake
nomad[4205]:   | 
nomad[4205]:   | Additional notes about plugin:
nomad[4205]:   |   Path: /usr/bin/nomad
nomad[4205]:   |   Mode: -rwxr-xr-x
nomad[4205]:   |   Owner: 0 [root] (current: 0 [root])
nomad[4205]:   |   Group: 0 [root] (current: 0 [root])
nomad[4205]:   |   ELF architecture: EM_AARCH64 (current architecture: arm64)

some more details:

podman version 4.7.2
cgroups v2

Procsiab commented 6 months ago

Hello there, I have a similar platform to this issue's author, however I am not able to reproduce the issue on a RaspberryPi hardware:

OS: Fedora IoT 39.20231130.0
Kernel: 6.6.2-201.fc39.aarch64
Nomad: 1.7.2
Podman 4.7.2 rootless
CGroup V2

I am able to plan and run a job file through the CLI. Could this be related to the fact that I use Nomad directly to collect centralized logs?

zyclonite commented 6 months ago

seems after the cpu fingerprinting refactoring the dmidecode package was missing on the host

would be still great if the fallback is again 1000mhz if the detection fails, that would keep the behaviour similar to the old version

apollo13 commented 5 months ago

I guess this can be closed here, since it would have to be fixed in nomad itself: https://github.com/hashicorp/nomad/issues/19412

hashicorp / nomad-driver-podman

Nomad 1.7.x breaks support for ARM deployments #310