open-telemetry / opentelemetry-python-contrib

OpenTelemetry instrumentation for Python modules
https://opentelemetry.io
Apache License 2.0
698 stars 579 forks source link

`system.cpu.utilization` values are always 0 in first export from `PeriodicExportingMetricReader` #2797

Open alexmojaki opened 1 month ago

alexmojaki commented 1 month ago

What happened?

Combining PeriodicExportingMetricReader and SystemMetricsInstrumentor produces only zero values for system.cpu.utilization in the first export, which may be the only export if the process doesn't last long.

Steps to Reproduce

from opentelemetry.instrumentation.system_metrics import SystemMetricsInstrumentor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
    ConsoleMetricExporter,
    PeriodicExportingMetricReader,
)
import time

exporter = ConsoleMetricExporter()
reader = PeriodicExportingMetricReader(exporter)
meter_provider = MeterProvider(metric_readers=[reader])
config = {"system.cpu.utilization": ["idle", "user", "system"]}
instrumentor = SystemMetricsInstrumentor(config=config)
instrumentor.instrument(meter_provider=meter_provider)

# Wait a bit to get realistic measurements
time.sleep(1)

Output:

{
  "resource_metrics": [
    {
      "resource": {
        "attributes": {
          "telemetry.sdk.language": "python",
          "telemetry.sdk.name": "opentelemetry",
          "telemetry.sdk.version": "1.26.0.dev0",
          "service.name": "unknown_service"
        },
        "schema_url": ""
      },
      "scope_metrics": [
        {
          "scope": {
            "name": "opentelemetry.instrumentation.system_metrics",
            "version": "0.48b0.dev",
            "schema_url": "https://opentelemetry.io/schemas/1.11.0"
          },
          "metrics": [
            {
              "name": "system.cpu.utilization",
              "description": "System CPU utilization",
              "unit": "1",
              "data": {
                "data_points": [
                  {
                    "attributes": {
                      "state": "idle",
                      "cpu": 1
                    },
                    "start_time_unix_nano": 0,
                    "time_unix_nano": 1723642134243694000,
                    "value": 0.0
                  },
                  {
                    "attributes": {
                      "state": "user",
                      "cpu": 1
                    },
                    "start_time_unix_nano": 0,
                    "time_unix_nano": 1723642134243694000,
                    "value": 0.0
                  },
                  {
                    "attributes": {
                      "state": "system",
                      "cpu": 1
                    },
                    "start_time_unix_nano": 0,
                    "time_unix_nano": 1723642134243694000,
                    "value": 0.0
                  },
                  {
                    "attributes": {
                      "state": "idle",
                      "cpu": 2
                    },
                    "start_time_unix_nano": 0,
                    "time_unix_nano": 1723642134243694000,
                    "value": 0.0
                  },
                  {
                    "attributes": {
                      "state": "user",
                      "cpu": 2
                    },
                    "start_time_unix_nano": 0,
                    "time_unix_nano": 1723642134243694000,
                    "value": 0.0
                  },
                  {
                    "attributes": {
                      "state": "system",
                      "cpu": 2
                    },
                    "start_time_unix_nano": 0,
                    "time_unix_nano": 1723642134243694000,
                    "value": 0.0
                  },
                  "... etc for every core ..."
                ]
              }
            }
          ],
          "schema_url": "https://opentelemetry.io/schemas/1.11.0"
        }
      ],
      "schema_url": ""
    }
  ]
}

Expected Result

Some values greater than 0.

Actual Result

"value": 0.0 for every data point.

Additional context

In https://psutil.readthedocs.io/en/latest/#psutil.cpu_times_percent it says:

Warning: the first time this function is called with interval = 0.0 or None it will return a meaningless 0.0 value which you are supposed to ignore.

which is where the problem comes from. However, the first time a user calls cpu_times_percent in the main thread, it actually returns a nonzero value. This is because psutil has this at the top level of the module:

_last_cpu_times = {threading.current_thread().ident: cpu_times()}

This means that when metrics are read on the main thread (e.g. by adding meter_provider.force_flush() to the script above, or by using InMemoryMetricReader) then the CPU values are nonzero.

The problem is that PeriodicExportingMetricReader naturally has to perform readings on a separate thread. From the same psutil docs:

Internally this function maintains a global map (a dict) where each key is the ID of the calling thread (threading.get_ident). This means it can be called from different threads, at different intervals, and still return meaningful and independent results.

Adding the following code to the script above anywhere between reader = ... and time.sleep seems to fix the problem:

import psutil

cpu_times = psutil.cpu_times_percent(percpu=True)
psutil._last_per_cpu_times_2[reader._daemon_thread.ident] = cpu_times

Actually implementing this fix in a sensible and maintainable way seems more complicated.

Would you like to implement a fix?

No

alexmojaki commented 3 weeks ago

The same problem exists for process.runtime.cpu.utilization, but it's more easily fixable:

import time

from opentelemetry.instrumentation.system_metrics import SystemMetricsInstrumentor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
    ConsoleMetricExporter,
    PeriodicExportingMetricReader,
)

exporter = ConsoleMetricExporter()
reader = PeriodicExportingMetricReader(exporter)
meter_provider = MeterProvider(metric_readers=[reader])
config = {'process.runtime.cpu.utilization': None}
instrumentor = SystemMetricsInstrumentor(config=config)
instrumentor.instrument(meter_provider=meter_provider)

# Uncomment to get nonzero values
# instrumentor._proc.cpu_percent()

# Use some CPU
sum(list(range(int(1e7))))

# Wait a bit to get realistic measurements
time.sleep(1)

Adding self._proc.cpu_percent() to the place where the instrument is created should fix it, because these values aren't tracked per thread. Maybe that'll change one day.