hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.8k stars 1.94k forks source link

Fix ARM64 CPU monitoring of jobs #17577

Closed TrueBrain closed 1 year ago

TrueBrain commented 1 year ago

Proposal

There are many tickets on this issue-tracker that explains that the cpu_total_compute doesn't work on ARM64. This ticket is not about that. It is however about the fact that, even if you fill in that value manually, or run on AWS where it is done for you, jobs and clients still show 0% CPU usage. And I think this is just an oversight, and can be fixed.

Use-cases

Not having CPU overview in the UI, and not receiving ticks via Prometheus, is just slightly annoying. Although the percentage works fine on Prometheus, and they are basically the same value, it is just nice to have ARM64 on-par with other platforms.

Attempted Solutions

In short, the problem seems to be in at what place cpu_total_compute can be overwritten by configuration.

In helper/stat/cpu.go the Init uses github.com/shirou/gopsutil/v3/cpu to detect CPU MHz. This fails on ARM64, and as such, cpuTotalTicks is set to zero.

In client/fingerprint/cpu.go the config request.Config.CpuCompute is checked, and if set, cpu_total_computer is set with this value.

Now everywhere else in the code to calculate CPU ticks, TotalTicksAvailable is used. But this cannot be changed via the above mentioned configuration. So although capacity etc is all calculated with cpu_total_compute, CPU usage is not. And this feels like an oversight, as those two are based on the same value: the total available MHz.

So, as an experiment I made a very ugly patch, that sets TotalTicksAvailable to return cpu_total_compute. And that actually just works fine. I can see CPU usage in the UI, and Prometheus starts to report it too.

I wanted to write a proper Pull Request to address this issue, but ... I just don't know how. I think the proper way would be to give helper/stat/cpu.go access to the configuration, so when cpu_total_compute is set, it uses that, instead of auto-detection. This would also make client/fingerprint/cpu.go a bit simpler .. basically, it would move the config setting upstream. But I just didn't know how to do that :)

Anyway, I hope this makes sense, and that you would be willing to help me out getting this fixed. Either by helping me point out how to actually do this, or by possibly leaning your head on the keyboard and doing this for me :D ;) Tnx either way!

PS: another odd thing about all this, is that CPU ticks is a gauge .. I really expected that to be a counter, so you don't have to constantly monitor the value .. if it would be a counter, I could just request the value once every 5 minutes, and still calculate the average CPU usage. As it is now a gauge, it is just another way to represent the percentage .. I find this a bit odd. But not relevant to the problem at hand :)

PPS: how I solved it is like this. But it is very dirty, and just meant to proof it actually addresses the issue.

diff --git a/client/fingerprint/cpu.go b/client/fingerprint/cpu.go
index c6c69e3b1f..64394f8f59 100644
--- a/client/fingerprint/cpu.go
+++ b/client/fingerprint/cpu.go
@@ -141,6 +141,7 @@ func (f *CPUFingerprint) setTotalCompute(request *FingerprintRequest, response *
        default:
                ticks = defaultCPUTicks
        }
+       stats.SetTotalTicksAvailable(ticks)
        response.AddAttribute("cpu.totalcompute", fmt.Sprintf("%d", ticks))
        f.resources.CPU = int(ticks)
        f.nodeResources.Cpu.CpuShares = int64(ticks)
diff --git a/helper/stats/cpu.go b/helper/stats/cpu.go
index 4ddb6f517a..b1755d1f8d 100644
--- a/helper/stats/cpu.go
+++ b/helper/stats/cpu.go
@@ -106,6 +106,10 @@ func CPUModelName() string {
        return cpuModelName
 }

+func SetTotalTicksAvailable(new uint64) {
+       cpuTotalTicks = new
+}
+
 // TotalTicksAvailable calculates the total MHz available across all cores.
 //
 // Where asymetric cores are correctly detected, the total ticks is the sum of
tgross commented 1 year ago

Hi @TrueBrain I'm not sure I understand how that would work. The order of operations you're proposing is:

I'd think if we wanted to take this approach we'd want to pass in the configured cpuTotalTicks as a new parameter to stats.Init()

TrueBrain commented 1 year ago

I see I wrote too much text; writing down your mind can be useful, but can also be misleading. Sorry about that :)

And no, I am not proposing that; my diff is utter shit, and shouldn't be merged :) It violates layering by just poking holes .. it is bad. But I don't know Go enough to make it proper :) It does proof it actually fixes the issue .. so there is that :) But let's forget about that diff :D

If with your last sentence you mean the config setting cpu_total_compute, we think the same thing, yes. I just have no clue how to go about that .. request.Config contains the config, but doesn't seem available when stats.Init is called. If you can point me in the right direction how to get it available, I have no problem making a PR out of this :)

TrueBrain commented 1 year ago

Owh, now I wrote this, I see I am a peanut .. it is just not passed to initialize. Let me create a PR :)