hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.92k stars 1.95k forks source link

Servers fingerprint 0Mhz of CPU in 1.7.1 #19412

Closed the-maldridge closed 11 months ago

the-maldridge commented 11 months ago

Nomad version

Output from nomad version

$ nomad version
Nomad v1.7.1
Revision 608e719430038cdeb5fe108536d90cf88a8540e3

Operating system and Environment details

Void Linux on amd64. Virtualized on DigitalOcean.

Issue

After upgrading to 1.7.1 I noticed my controller proxies weren't running due to resource exhaustion. This was suspicious since they'd been just fine previously. On inspection the nomad servers are now all reporting 0Mhz of CPU available.

Reproduction steps

Upgrade a cluster across the 1.6.3 to 1.7.1 with server and client enabled on the same machine. After the upgrade, the servers which run embedded clients have 0Mhz CPU, whereas clients that are distinct are unaffected and operate as expected.

Expected Result

I expected to have CPU continue to be fingerprinted as it was previously.

Actual Result

There is no CPU fingerprinted on server nodes, so I now can't schedule tasks to them.

Job file (if appropriate)

N/A

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

N/A

ahjohannessen commented 11 months ago

I think you need to have dmidecode installed.

tgross commented 11 months ago

Potentially related: https://github.com/hashicorp/nomad/issues/19406

the-maldridge commented 11 months ago

When did dmidecode become an external dependency?

Settler commented 11 months ago

We have same issue. We suspect that it happens because dmidecode cannot be executed if nomad runs under regular user without root access. dmidecode requires root access, so nomad cannot read info from this command in production environment. Running dmidecode under nomad user prints this:

dmidecode 3.3
/sys/firmware/dmi/tables/smbios_entry_point: Permission denied
Scanning /dev/mem for entry point.
/dev/mem: Permission denied

Update: We confirmed this reason. Nomad under root access successfully read CPU frequency. So, the workaround is to run nomad under root...

tgross commented 11 months ago

When did dmidecode become an external dependency?

dmidecode became the fallback for fingerprinting CPU with the improvements to CPU fingerprinting done in 1.7.0. We missed documenting this in the upgrade guide, and we've got an issue open to get it added to the packaging https://github.com/hashicorp/nomad/issues/19382 (but I realize this doesn't help your environment @the-maldridge so that'll probably be a packaging change you'll want to do). Note that we've got another issue related to the new fingerprinting https://github.com/hashicorp/nomad/issues/19417 so you might want to hold off doing that until we've shipped the patch for that. We're planning on doing that ASAP.

So, the workaround is to run nomad under root...

Correct. This is an existing requirement for Nomad clients. See commands/agent, permissions, and/or hardening Nomad.

the-maldridge commented 11 months ago

@tgross what makes no sense then is why my environment is falling back. Some of these machines in my environment are VMs, and only some of them have dmidecode installed, but only the servers hit this issue. It seems like whatever the primary method should have been just bailed out and then dmidecode failed.

tgross commented 11 months ago

We're still working on figuring that out. The /sys file system we're getting from some providers is giving us an unexpected set of cores, and that's triggering bugs in our NUMA detection logic. Sorry I don't have a better answer for you yet but we've got several people actively working on it.

roman-vynar commented 11 months ago

Getting this too on 1.7.1:

    2023-12-12T10:38:43.953Z [INFO]  client.alloc_runner.task_runner: 
Task event: alloc_id=327da011-6dfc-18c8-57bd-6e38e72051b3 task=jenkins-worker type=Received 
msg="Task received by client" failed=false
    2023-12-12T10:38:43.956Z [ERROR] client.alloc_runner: prerun failed: 
alloc_id=327da011-6dfc-18c8-57bd-6e38e72051b3 
error="pre-run hook \"cpuparts_hook\" failed: open /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus: no such file or directory"
    2023-12-12T10:38:43.957Z [INFO]  client.alloc_runner.task_runner: 
Task event: alloc_id=327da011-6dfc-18c8-57bd-6e38e72051b3 task=jenkins-worker type="Setup Failure" 
msg="failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: open /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus: no such file or directory" failed=true

I am not sure why it is looking at /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus The right path in my case is /sys/fs/cgroup/nomad.slice/cpuset.cpus and that file is empty.

Moreover, dmidecode -t 4 runs ok but somehow it's not enough for Nomad to figure out cpu compute units, it reports 0 because of the above - looking by non-existing path.

caiodelgadonew commented 11 months ago

Seems it was introduced on 1.7.0 all servers running still centos7 got that issue, almalinux 9 servers were fine.

$ cat /etc/*release
CentOS Linux release 7.9.2009 (Core)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

CentOS Linux release 7.9.2009 (Core)
CentOS Linux release 7.9.2009 (Core)

$ dmidecode -t 4
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.

Handle 0x0004, DMI type 4, 42 bytes
Processor Information
    Socket Designation: CPU 0
    Type: Central Processor
    Family: Xeon
    Manufacturer: Intel(R) Corporation
    ID: 57 06 05 00 FF FB EB BF
    Signature: Type 0, Family 6, Model 85, Stepping 7
    Flags:
        FPU (Floating-point unit on-chip)
        VME (Virtual mode extension)
        DE (Debugging extension)
        PSE (Page size extension)
        TSC (Time stamp counter)
        MSR (Model specific registers)
        PAE (Physical address extension)
        MCE (Machine check exception)
        CX8 (CMPXCHG8 instruction supported)
        APIC (On-chip APIC hardware supported)
        SEP (Fast system call)
        MTRR (Memory type range registers)
        PGE (Page global enable)
        MCA (Machine check architecture)
        CMOV (Conditional move instruction supported)
        PAT (Page attribute table)
        PSE-36 (36-bit page size extension)
        CLFSH (CLFLUSH instruction supported)
        DS (Debug store)
        ACPI (ACPI supported)
        MMX (MMX technology supported)
        FXSR (FXSAVE and FXSTOR instructions supported)
        SSE (Streaming SIMD extensions)
        SSE2 (Streaming SIMD extensions 2)
        SS (Self-snoop)
        HTT (Multi-threading)
        TM (Thermal monitor supported)
        PBE (Pending break enabled)
    Version: Intel(R) Xeon(R) Platinum 8223CL CPU @ 3.00GHz
    Voltage: 1.6 V
    External Clock: 100 MHz
    Max Speed: 3500 MHz
    Current Speed: 3000 MHz
    Status: Populated, Enabled
    Upgrade: Socket LGA3647-1
    L1 Cache Handle: 0x0005
    L2 Cache Handle: 0x0006
    L3 Cache Handle: 0x0007
    Serial Number: Not Specified
    Asset Tag: Not Specified
    Part Number: Not Specified
    Core Count: 8
    Core Enabled: 8
    Thread Count: 16
    Characteristics:
        64-bit capable
        Multi-Core
        Hardware Thread
        Execute Protection
FibreFoX commented 11 months ago

After upgrading from 1.6.3 to 1.7.1 results in no allocations being able to start anymore.

Running on Raspberry Pi 4B on RaspbianOS Bookworm with following 64bit AMR64 kernel (customized, HugeTLB pages got activated in addition):

Linux nomadnode 6.1.63-v8-huge+ #1 SMP PREEMPT Thu Nov 23 13:17:36 CET 2023 aarch64 GNU/Linux

Sadly dmidecode results in nothing:

# dmidecode 3.4
# No SMBIOS nor DMI entry point found, sorry.

Yes, Nomad runs as root.

As already pointed out https://github.com/hashicorp/nomad/issues/19412#issuecomment-1852450734 the path seems to be different (have they changed?):

root@nomadnode:/home/pi# ls -althr /sys/fs/cgroup/nomad.slice/cpuset.cpus
-rw-r--r-- 1 root root 0 Dec 13 12:50 /sys/fs/cgroup/nomad.slice/cpuset.cpus
root@nomadnode:/home/pi# ls -althr /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus
ls: cannot access '/sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus': No such file or directory
tgross commented 11 months ago

Ok, just a quick update / pointer for folks. Anyone running into the following log line:

open /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus: no such file or directory"

is experiencing a different bug than we're tracking in this issue. In this case cgroups are mounted but the cpuset controller is not enabled. In previous versions of Nomad we allowed such a configuration at the expense of not actually enforcing resource utilization, but in 1.7 it's mandatory.

There's some discussion about this happening in #19176. So if you're experiencing that problem, please take it over to that thread so that we can focus on the other problem we're seeing for folks who have all the expected cgroup controllers.

tgross commented 11 months ago

Hi folks! Just wanted to give an update on this and the related set of bugs around panics and fingerprinting (#19407, #19372, #19412). @pkazmierczak @shoenig @lgfa29 and I have been working on trying to get a reproduction of all these issues and they turn out to be interrelated and it depends a bit on your particular environment which one you're going to hit.

Reliable and accurate CPU fingerprinting is highly platform specific. In Nomad 1.7.0 as part of work for NUMA support (and to reduce other problems in our CPU accounting), we introduced a new CPU fingerprinting scheme for Linux, where we go through a series of "scanner" options (ref detect_linux.go#L20-L27) of decreasing accuracy until we get a valid CPU topology. Unfortunately there are a lot of platforms where this scanning is still coming up with no valid topology.

Depending on the shape of your particular environment, you might end up with 0 cpu reported. Or you might hit the code path where the scheduler tries to read the nil CPU topology and you get a panic.

Our fix for this is going to be to reintroduce the less accurate pre-1.7.0 fingerprinting as a final fallback for those platforms where we can't make sense of the CPU topology. This means NUMA support won't work in those environments, but largely those platforms appear to be those where NUMA isn't meaningful (ex. containers, hypervisors that cut up a NUMA core between VMs, etc.).

We're working on the patch for this now and will ship it as soon as its available. Thanks for your help in lots of detailed reports and your patience while we work thru this.

MorphBonehunter commented 11 months ago

Maybe another another suggestion to the dmidecode stuff @tgross.

In virtual environments, at least KVM with qemu virtual CPU Models (default for example used in Proxmox or oVirt), the dmidecode exposes wrong MHz while lscpu / /proc/cpuinfo do expose much better stuff.

So for example in my lab environment i have 2x 1097 MHz (old CPU) which was detected ok before the upgrade to 1.7.1. After that the Frequency was 0 MHz until i install dmidecode in the VM. Now I have 2x 2000 MHz because of the dmidecode reporting (while lscpu -Je=MHZ and grep -i mhz /proc/cpuinfo shows the real numbers).

Yes...maybe that's an corner case with my old crap but maybe you could think about it while you reintegrate the old fingerprinting and drop the dmidecode stuff ;)

shoenig commented 11 months ago

Hi @MorphBonehunter if you could file a separate issue, that would be helpful. While I believe we now have a fix in place for the undetected case in this thread, resolving a best source of truth would be better discussed separately.

tgross commented 11 months ago

Our patches for this issue have landed and will be shipped in Nomad 1.7.2 shortly (likely tomorrow).

tgross commented 11 months ago

Nomad 1.7.2 has shipped. We've got #19468 open as follow-up for whether the fingerprinting can be improved and made more accurate. I'm going to close this issue out as shipped. If you're still seeing this after 1.7.2, please let us know and we can reopen. Thanks!

FelipeLopes-systematica commented 10 months ago

I believe this should be reopened as it's still happening in Nomad 1.7.2: failed to setup alloc: pre-run hook "cpuparts_hook" failed: open /sys/fs/cgroup/cpuset/nomad/share/cpuset.cpus: no such file or directory

FibreFoX commented 10 months ago

@FelipeLopes-systematica have you checked https://github.com/hashicorp/nomad/issues/19481 ? I reported that issue running Nomad on a Raspberry Pi. Maybe it fits your problem too.

FelipeLopes-systematica commented 10 months ago

@FelipeLopes-systematica have you checked #19481 ? I reported that issue running Nomad on a Raspberry Pi. Maybe it fits your problem too.

Thanks @FibreFoX , are you saying that I might be missing the enabling of memory and cpu in my cgroup config? I'm running Nomad on WSL atm, Debian based.

mount | grep cgroup2
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
FibreFoX commented 10 months ago

@FelipeLopes-systematica With WSL you probably are not able to change these .... your problem is probably not related to this issue at all.

I've checked WSL, but you probably wont be able to really use that. cat /sys/fs/cgroup/cgroup.controllers -> cat: /sys/fs/cgroup/cgroup.controllers: No such file or directory cat /sys/fs/cgroup/unified/cgroup.controllers -> empty

Please open a new issue for this, as it really is something different.

FelipeLopes-systematica commented 10 months ago

I'm just wondering why this was working fine in Nomad 1.6. In my case, I'm sharing the host's cgroup with all docker containers, to simulate a multi-DC cluster locally. Was working perfectly in Nomad 1.6.