Closed TrueBrain closed 1 year ago
Hi @TrueBrain I'm not sure I understand how that would work. The order of operations you're proposing is:
stats.Init()
, which sets the cpuTotalTicks
and a whole mess of other globals in that package.setTotalCompute
, which reads stats.TotalTicksAvailable()
, which in turn is the only reader of cpuTotalTicks
stats.SetTotalTicksAvailable
, which writes to cpuTotalTicks
.I'd think if we wanted to take this approach we'd want to pass in the configured cpuTotalTicks
as a new parameter to stats.Init()
I see I wrote too much text; writing down your mind can be useful, but can also be misleading. Sorry about that :)
And no, I am not proposing that; my diff is utter shit, and shouldn't be merged :) It violates layering by just poking holes .. it is bad. But I don't know Go enough to make it proper :) It does proof it actually fixes the issue .. so there is that :) But let's forget about that diff :D
If with your last sentence you mean the config setting cpu_total_compute
, we think the same thing, yes. I just have no clue how to go about that .. request.Config
contains the config, but doesn't seem available when stats.Init
is called. If you can point me in the right direction how to get it available, I have no problem making a PR out of this :)
Owh, now I wrote this, I see I am a peanut .. it is just not passed to initialize
. Let me create a PR :)
Proposal
There are many tickets on this issue-tracker that explains that the
cpu_total_compute
doesn't work on ARM64. This ticket is not about that. It is however about the fact that, even if you fill in that value manually, or run on AWS where it is done for you, jobs and clients still show 0% CPU usage. And I think this is just an oversight, and can be fixed.Use-cases
Not having CPU overview in the UI, and not receiving
ticks
via Prometheus, is just slightly annoying. Although thepercentage
works fine on Prometheus, and they are basically the same value, it is just nice to have ARM64 on-par with other platforms.Attempted Solutions
In short, the problem seems to be in at what place
cpu_total_compute
can be overwritten by configuration.In
helper/stat/cpu.go
theInit
usesgithub.com/shirou/gopsutil/v3/cpu
to detect CPU MHz. This fails on ARM64, and as such,cpuTotalTicks
is set to zero.In
client/fingerprint/cpu.go
the configrequest.Config.CpuCompute
is checked, and if set,cpu_total_computer
is set with this value.Now everywhere else in the code to calculate CPU ticks,
TotalTicksAvailable
is used. But this cannot be changed via the above mentioned configuration. So although capacity etc is all calculated withcpu_total_compute
, CPU usage is not. And this feels like an oversight, as those two are based on the same value: the total available MHz.So, as an experiment I made a very ugly patch, that sets
TotalTicksAvailable
to returncpu_total_compute
. And that actually just works fine. I can see CPU usage in the UI, and Prometheus starts to report it too.I wanted to write a proper Pull Request to address this issue, but ... I just don't know how. I think the proper way would be to give
helper/stat/cpu.go
access to the configuration, so whencpu_total_compute
is set, it uses that, instead of auto-detection. This would also makeclient/fingerprint/cpu.go
a bit simpler .. basically, it would move the config setting upstream. But I just didn't know how to do that :)Anyway, I hope this makes sense, and that you would be willing to help me out getting this fixed. Either by helping me point out how to actually do this, or by possibly leaning your head on the keyboard and doing this for me :D ;) Tnx either way!
PS: another odd thing about all this, is that CPU ticks is a gauge .. I really expected that to be a counter, so you don't have to constantly monitor the value .. if it would be a counter, I could just request the value once every 5 minutes, and still calculate the average CPU usage. As it is now a gauge, it is just another way to represent the percentage .. I find this a bit odd. But not relevant to the problem at hand :)
PPS: how I solved it is like this. But it is very dirty, and just meant to proof it actually addresses the issue.