hashicorp / nomad-driver-podman

A nomad task driver plugin for sandboxing workloads in podman containers
https://developer.hashicorp.com/nomad/plugins/drivers/podman
Mozilla Public License 2.0
224 stars 61 forks source link

CPU utilization not reported on aarch64 #217

Closed courtland closed 7 months ago

courtland commented 1 year ago

Under AWS Graviton aarch64/arm64 instances, the CPU utilization reported by the nomad client is 0.

My understanding is this is a function of the driver (podman in my case) and not the nomad client. I briefly tried to track down where this could be breaking, but the //FIXME implement cpu stats correctly in runStatsEmitter made me wonder if somehow that's related?

podman stats shows container usage correctly.

Ubuntu 22.04.2 LTS // podman version 3.4.4

This is somewhat related to https://github.com/hashicorp/nomad/issues/4233

towe75 commented 1 year ago

hey @courtland , thank you for reaching out. Just to verify: does podman (without nomad...) show the stats?

podman stats
lgfa29 commented 1 year ago

Hi @courtland 👋

Could you also check if you're running cgroups v2 and if switching to v1 fixes the problem? This may be related to https://github.com/hashicorp/nomad-driver-podman/issues/160.

courtland commented 1 year ago

Yes, podman stats correctly shows CPU usage.

I am indeed running cgroups v2 (ubuntu jammy/22.04), so perhaps that is likely the culprit. I will work on testing an instance with v1.

courtland commented 1 year ago

Switching to cgroups v1 does NOT fix the problem. It's also worth noting that client CPU utilization reports correctly on my x86_64/amd64 instances (everything else is the same except arch).

lgfa29 commented 1 year ago

Thanks for testing, it seems like we will need to investigate this further 👍

Procsiab commented 9 months ago

Hello there, I was interested in helping investigate this issue since I am also using Nomad to orchestrate container workloads on ARM 64 bit based nodes: I am not able to reproduce the absence of CPU statistics, so maybe I have done something different in my setup with respect to @courtland ... Maybe it's just the version of Podman or the permission on the cgroup slice folder; i'll attach below some pictures taken from Nomad's UI, about an aarch64 node I am using (it's a Raspberry Pi 3B)

Screenshot from 2023-10-04 19-27-08

Screenshot from 2023-10-04 19-27-33

lgfa29 commented 7 months ago

Thank you for the extra info @Procsiab.

CPU fingerprinting and stat is something we've been fixing in Nomad (specially on ARM) for the past few releases, so we may have fixed this at some point. Which version of Nomad are you using?

@courtland by any chance would you be able to check if this is still a problem?

Thanks!

Procsiab commented 7 months ago

In reply to @lgfa29 and to integrate my previous post: at the time of writing it I was using Nomad 1.6.2. I am now using Podman 4.7.2 and Nomad 1.6.3 on the same ARM hardware and I am still not experiencing the issue we are discussing here.

courtland commented 7 months ago

Thank you for the extra info @Procsiab.

CPU fingerprinting and stat is something we've been fixing in Nomad (specially on ARM) for the past few releases, so we may have fixed this at some point. Which version of Nomad are you using?

@courtland by any chance would you be able to check if this is still a problem?

Thanks!

Yes, Nomad 1.6.x seems to have resolved this problem. Closing this issue. Thanks for following up and looking into it @Procsiab