open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.13k stars 2.39k forks source link

[receiver/hostmetricsreceiver] Gopsutil error on windows with multiple processor groups. #33340

Open alvarocabanas opened 5 months ago

alvarocabanas commented 5 months ago

Component(s)

receiver/hostmetricsreceiver

What happened?

Description

On a windows instance having 48 * 2 logical cpus, windows groups CPUs into batches of 64 logical cpus in a processor group, but Gopsutil's Cpu.TimesWitContext defined on the cpuscraper in here and used to by the resource detector to calculate the cpu times and utilization, indistinctly gets this data from one of the two processor groups and not all of them.

This is generating variable cpu Utilization, sometimes negative and sometimes full usage even if only half the cores are running.

This bug is reported on the gopsutil library.

Steps to Reproduce

In our case we reproduced it on a 'm5n.metal' machine in AWS with windows-server-22 but we know reports of it happening in other windows with more than one processor groups.

Expected Result

Correct Cpu Usage and times.

Actual Result

Cpu data points from one of the 2 processor groups randomly.

Collector version

v0.101.0

Environment information

Environment

OS: Windows-server 22

OpenTelemetry Collector configuration

receivers:
    hostmetrics:
      collection_interval: 20s
      scrapers:
        cpu:
          metrics:
            system.cpu.time:
              enabled: true
            system.cpu.utilization:
              enabled: true
  processors:
    # group system.cpu metrics by cpu
    metricstransform:
      transforms:
        - include: system.cpu.utilization
          action: update
          operations:
            - action: aggregate_labels
              label_set: [ state ]
              aggregation_type: mean
[ ... ]

Log output

No response

Additional context

No response

github-actions[bot] commented 5 months ago

Pinging code owners for receiver/hostmetrics: @dmitryax @braydonk. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 2 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

atoulme commented 1 month ago

Unfortunately, the fix seems to reside in gopsutil.