intel / powertelemetry

Internal sources of Power Telemetry Library. Power Telemetry Library is a golang library that provides power-related CPU info.
Apache License 2.0
8 stars 2 forks source link

intel_powerstat not working if deployed within a container #6

Open divStar opened 6 months ago

divStar commented 6 months ago

Hello,

TL;DR:

intel_powerstat does not work if Telegraf is deployed in a container. I am referring to this ticket https://github.com/influxdata/telegraf/issues/14881.


Relevant telegraf.conf

Relevant telegraf.conf ```toml [[inputs.intel_powerstat]] interval = "10s" ## The user can choose which package metrics are monitored by the plugin with ## the package_metrics setting: ## - The default, will collect "current_power_consumption", ## "current_dram_power_consumption" and "thermal_design_power" ## - Leaving this setting empty means no package metrics will be collected ## - Finally, a user can specify individual metrics to capture from the ## supported options list ## Supported options: ## "current_power_consumption", "current_dram_power_consumption", ## "thermal_design_power", "max_turbo_frequency", "uncore_frequency", ## "cpu_base_frequency" package_metrics = ["current_power_consumption", "current_dram_power_consumption"] ## The user can choose which per-CPU metrics are monitored by the plugin in ## cpu_metrics array. ## Empty or missing array means no per-CPU specific metrics will be collected ## by the plugin. ## Supported options: ## "cpu_frequency", "cpu_c0_state_residency", "cpu_c1_state_residency", ## "cpu_c6_state_residency", "cpu_busy_cycles", "cpu_temperature", ## "cpu_busy_frequency" ## ATTENTION: cpu_busy_cycles is DEPRECATED - use cpu_c0_state_residency cpu_metrics = ["cpu_frequency", "cpu_c0_state_residency", "cpu_c1_state_residency","cpu_c6_state_residency", "cpu_busy_frequency"] ```

Logs from Telegraf

Telegraf logs ```text 2024-02-22T15:53:22Z I! Loading config: /etc/telegraf/telegraf.conf 2024-02-22T15:53:22Z I! Starting Telegraf 1.29.5 brought to you by InfluxData the makers of InfluxDB 2024-02-22T15:53:22Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 60 outputs, 6 secret-stores 2024-02-22T15:53:22Z I! Loaded inputs: intel_powerstat mqtt_consumer 2024-02-22T15:53:22Z I! Loaded aggregators: 2024-02-22T15:53:22Z I! Loaded processors: 2024-02-22T15:53:22Z I! Loaded secretstores: 2024-02-22T15:53:22Z I! Loaded outputs: influxdb_v2 2024-02-22T15:53:22Z I! Tags enabled: host=233d22a54fc0 2024-02-22T15:53:22Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"233d22a54fc0", Flush Interval:10s 2024-02-22T15:53:52Z I! [inputs.mqtt_consumer] Connected [tcp://mqtt.my.family:1883] 2024-02-22T15:53:52Z W! [inputs.intel_powerstat] Plugin started with errors: PowerTelemetry instance initialized with errors: failed to initialize msr: invalid MSR base path "/dev/cpu": file "/dev/cpu" does not exist; failed to initialize rapl: invalid base path of rapl control zone: file "/sys/devices/virtual/powercap/intel-rapl" does not exist 2024-02-22T15:54:00Z E! [inputs.intel_powerstat] Error in plugin: failed to update MSR time-related metrics: module "msr" is not initialized 2024-02-22T15:54:00Z E! [inputs.intel_powerstat] Error in plugin: failed to get "current_power_consumption": module "rapl" is not initialized 2024-02-22T15:54:00Z E! [inputs.intel_powerstat] Error in plugin: failed to get "current_dram_power_consumption": module "rapl" is not initialized ```

System info

Ubuntu 22.04.03, Telegraf 1.29.5, Docker (Server Version) 25.0.3

Docker

docker-compose.yml ```yaml version: '3' services: telegraf: image: telegraf:latest container_name: telegraf restart: unless-stopped environment: INFLUX_TOKEN: "" HOST_ETC: "/hostfs/etc" HOST_PROC: "/hostfs/proc" HOST_SYS: "/hostfs/sys" HOST_VAR: "/hostfs/var" HOST_RUN: "/hostfs/run" HOST_MOUNT_PREFIX: "/hostfs" volumes: - '/telegraf.conf:/etc/telegraf/telegraf.conf' - '/:/hostfs:ro' # depends_on: # - influxdb networks: - services-network networks: services-network: external: true ```

Steps to reproduce

  1. Ensure your system supports Intel MSR and/or RAPL and that the appropriate kernel modules have been loaded (e.g. using lsmod | grep rapl).
  2. Ensure your system has cpuid installed (sudo apt-get install -y cpuid)-
  3. Set up a network (in my example it's called services-network and is a bridge-type network).
  4. Create a docker-compose.yaml with just Telegraf - as mentioned in the docker part above.
  5. Configure it to use input.intel_powerstat.
  6. Run the docker-compose.yaml file.
  7. Wait about 20 seconds.

Expected behavior

I expect the plugin to look for PowerTelemtry inside /hostfs/sys/... or /hostfs/dev/... etc., to not throw any errors and ultimately grab the corresponding values.

Actual behavior

As 2024-02-22T15:53:52Z W! [inputs.intel_powerstat] Plugin started with errors: PowerTelemetry instance initialized with errors: failed to initialize msr: invalid MSR base path "/dev/cpu": file "/dev/cpu" does not exist; failed to initialize rapl: invalid base path of rapl control zone: file "/sys/devices/virtual/powercap/intel-rapl" does not exist states, the plug in does not find the corresponding folders. /hostfs/dev/cpu and `/hostfs/sys/devices/virtual/powercap/intel-rapl" do indeed exist, but they seem to not be found.

Additional info

I've checked out the project and tried looking around, but I cannot find where (if at all) HOST_MOUNT_PREFIX or any of the HOST_* environment variables would be used. They are used to some extent in other plugins it seems, but not in this one.

Edit: I also figured the following: when installing Telegraf locally - even though MSR and RAPL are available - I had to do a couple of things before I could use it locally, namely this:

sudo chmod -R a+r /sys/devices/virtual/powercap/
sudo setcap cap_sys_rawio=ep /usr/bin/telegraf
sudo systemctl restart telegraf

After that, Telegraf started working locally and sending values to my InfluxDB in the container as I'd expect it to.

In the containerized Telegraf instance though, even mounting to /sys and /dev directly (not /hostfs/sys and /hostfs/dev) and even using privileged: true and user: "0:0", I could not get it to work.