Available CPU MHz Varying Wildly for Same Instance Type

herter4171 commented 4 years ago

Nomad version

Nomad v0.10.4 (f750636ca68e17dcd2445c1ab9c5a34f9ac69345)

Operating system and Environment details

Amazon Linux 2 with a fixed head node and an auto-scaling group of c5.24xlarge instances, with scaling driven by Nomad state using a custom cloud metric.

Issue

The number of MHz available on a node varies wildly. For the exact same instance type (96 cores, 3 GHz stock, 3.9 GHz max), I'm seeing as low as 1.6E5 MHz all the way up to 3.4E5 MHz. Just now, I've launched 3 c5.24xlarge nodes, and their max MHz are

305184
280128
340800

I'd rather not hard-wire cpu_total_compute in the client config, and everything else I've read claims Nomad sets the MHz based on core count multiplied by rated clock speed rather than current.

Having MHz vary like this causes jobs to not be placed, even when the node actually has the capacity. Would a short-term fix be forcing all but one core to 100%, launching the Nomad client, and taking the load off of CPU? The docs I've read claim Nomad uses stock clock speed, so I'm kind of at a loss here.

Reproduction steps

Launch a few instances of the same type with the Nomad client running on boot (I'm using systemctl). Rated MHz for each client in the web UI should vary appreciably.

jrasell commented 4 years ago

Hi @herter4171 and thanks for the detail in this issue. In order to help diagnose this problem would you be able to provide the output of the following two commands from a couple of the instances where you are seeing this behaviour?

cat /proc/cpuinfo |grep 'cpu MHz' cat /proc/cpuinfo |grep 'cpu cores'

shoenig commented 4 years ago

Seems like parsing cpu MHz out of /proc/cpuinfo is only going to get us current clockspeed, which could vary widely given power states, etc. Has nomad always determined clock speed this way? We should be getting the rated speed, instead. e.g.

$ lscpu | grep MHz
CPU MHz:                         3899.997
CPU max MHz:                     4700.0000
CPU min MHz:                     400.0000

herter4171 commented 4 years ago

Hi @jrasell, thank you for the response! Output for those grep commands are a bit lengthy due to there being 96 cores. Here is some truncated output.

For the first instance,

$ cat /proc/cpuinfo | grep 'cpu MHz' | head -n 1
cpu MHz         : 1843.994
$ cat /proc/cpuinfo |grep 'cpu cores' | head -n 1
cpu cores       : 24

For the second instance,

$ cat /proc/cpuinfo | grep 'cpu MHz' | head -n 1
cpu MHz         : 1677.167
$ cat /proc/cpuinfo |grep 'cpu cores' | head -n 1
cpu cores       : 24

For the third instance,

$ cat /proc/cpuinfo | grep 'cpu MHz' | head -n 1
cpu MHz         : 1506.577
$ cat /proc/cpuinfo |grep 'cpu cores' | head -n 1
cpu cores       : 24

Givne the nproc output, I'm guessing the "cpu cores" output of 24 implies there are four physical processors.

$ nproc
96

dvusboy commented 4 years ago

Nomad uses gopsutil.cpu.InfoStat to get the CPU MHz, and by default, it uses /sys/devices/system/cpu/cpuN/cpufreq/cpuinfo_max_freq on Linux, to determine the maximum frequency of the CPU see. But it will fall back on value from /proc/cpuinfo if that failed. You should check that sysfs path on your VMs, @herter4171.

herter4171 commented 4 years ago

Hi @dvusboy, the lay of the land is that I'm using Amazon Linux 2 pretty much out of the box. That platform has /sys/devices/system/cpu, and from there it's cpu0 and so on. The subdirectories for cpu* don't have a cpufreq directory, so I'm not sure how to proceed. Is there something I can do to populate that? This seems like a pretty major detail for supporting Nomad on Amazon Linux 2, and I'd like to avoid switching distros.

dvusboy commented 4 years ago

@herter4171 By cpuN, I meant, substituting N with some non-negative integer. Since cpufreq is not there, I'd say you don't have access to the actual maximum frequency, and gopsutil defaults to MHz out of cpuinfo, which is the current frequency. It would explain what you're seeing.

herter4171 commented 4 years ago

@dvusboy, I latently picked up on that and edited my last comment accordingly. Can I do something to make Amazon Linux 2 play ball for Nomad, or can something be done on the Nomad side of things to fix this? One idea I have is spawning yes > /dev/null & for all but one core before launching the Nomad client to make Nomad recognize actual MHz, but I'd really appreciate some support for the given platform. I can't be the only guy running Nomad on Amazon Linux 2, after all.

dvusboy commented 4 years ago

I suppose you can use cpu_total_compute in the client configuration to override the fingerprinted values.

herter4171 commented 4 years ago

@dvusboy, I'm aware of that option, and I don't think it addresses the core issue. Nomad should be capable enough to set available MHz.

herter4171 commented 4 years ago

Hi @dvusboy and company, after rooting around a bit, I can see the difficulty in getting rated clock speed on Amazon Linux 2 without assumed access to sudo. In case it helps on your end, what I've put in place for initializing a Nomad client is as follows.

# Get max rated core speed
CORE_MAX_MHZ=$(sudo dmidecode processor-frequency \
    | grep '^\s*Max Speed' \
    | head -n 1 \
    | awk '{print $3}')

# Multiply by number of cores to get total MHz
TOTAL_MHZ=$((CORE_MAX_MHZ*`nproc`))

I'd still like to see this functionality become native instead of depending on my hacky Bash, but I'm equipped to move on if there's not interest in pursuing this. Thanks for the help so far.

herter4171 commented 4 years ago

Hey @shoenig, I'm having a bit of additional difficulty in spite of my fix. Even though I've set the client stanza like I described and verified the updated value is reflected in Nomad, jobs still fail to be placed due to this other hidden limit shown in my screenshot. I'm a bit confused, because 262144 MHz / 96 cores = 2.73 GHz/core, and that's above the rated speed of 2.5 GHz and well below the max of 3.5 GHz. I'd hope to be able to move on with things, but this is still holding things back, I'm afraid.

shoenig commented 4 years ago

I'm thinking this is actually a problem on all EC2 instances, not just Linxu2. On an Ubuntu micro:

ubuntu@ip-172-31-82-121:~$ cpupower frequency-info
analyzing CPU 0:
  no or unknown cpufreq driver is active on this CPU
  CPUs which run at the same hardware frequency: Not Available
  CPUs which need to have their frequency coordinated by software: Not Available
  maximum transition latency:  Cannot determine or is not supported.
Not Available
  available cpufreq governors: Not Available
  Unable to determine current policy
  current CPU frequency: Unable to call hardware
  current CPU frequency:  Unable to call to kernel
  boost state support:
    Supported: no
    Active: no

ubuntu@ip-172-31-82-121:~$ # there is no cpufreq/cpuinfo_max_freq
ubuntu@ip-172-31-82-121:~$ ls /sys/devices/system/cpu/cpu0
cache  crash_notes  crash_notes_size  driver  firmware_node  hotplug  node0  power  subsystem  topology  uevent
ubuntu@ip-172-31-82-121:~$ ls /sys/devices/system/cpu/cpufreq  # empty

If there's any good news, the CPU cgroup management seems unaffected

Allocated Resources
CPU           Memory          Disk
250/2400 MHz  32 MiB/983 MiB  300 MiB/6.6 GiB

Allocation Resource Utilization
CPU         Memory
0/2400 MHz  388 KiB/983 MiB

Host Resource Utilization
CPU            Memory           Disk
2400/2400 MHz  146 MiB/983 MiB  1.4 GiB/8.0 GiB  # loaded deliberately

[ec2-user@ip-172-31-94-218 proc]$ cat /proc/cgroups
#subsys_name    hierarchy   num_cgroups enabled
cpuset  11  3   1
cpu 9   3   1
cpuacct 9   3   1
blkio   10  3   1
memory  6   3   1
devices 5   25  1
freezer 4   3   1
net_cls 2   3   1
perf_event  8   3   1
net_prio    2   3   1
hugetlb 7   3   1
pids    3   3   1

I'm going to keep researching and asking around, but I suspect this may boil down to parsing the rated CPU speed out of the CPU model name string. Hacky as that may be, it should be more accurate than parsing cpu MHz, which is tantamount to using a random number.

herter4171 commented 4 years ago

Hey @shoenig, thanks for the digging. One thing about using model name I've noticed is that certain instance types, like "memory optimized," use AMD chips that don't have the rated frequency in the name like Intel procs tends to. Also, I think the driver error I'm seeing in the pic from my last comment is related to this issue, since it's requiring a value for MHz between rated and max. I'd be happy to open a separate thread for that if it's going to muddy waters here, though.

shoenig commented 4 years ago

Another possibility might be to modify gopsutil to briefly load a single CPU thread and take measurements of the current speed, the maximum of which would be presumed to be the max CPU speed.

I put together a quick demo to check if this works, before submitting the idea upstream

$ for i in {1..10}; do ./loadcpu && sleep 3 && echo ""; done
read current speed: 800.04
loaded max speed:   3900.70

read current speed: 1924.65
loaded max speed:   3901.08

read current speed: 1495.16
loaded max speed:   3900.33

read current speed: 2826.81
loaded max speed:   3900.00

read current speed: 3400.18
loaded max speed:   3902.43

read current speed: 1979.91
loaded max speed:   3900.95

read current speed: 2627.13
loaded max speed:   3900.19

read current speed: 889.96
loaded max speed:   3901.62

read current speed: 3391.65
loaded max speed:   3902.97

read current speed: 906.17
loaded max speed:   3900.63

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / nomad