open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.9k stars 2.27k forks source link

Support for GPU (CUDA) metrics #392

Closed ymotongpoo closed 1 year ago

ymotongpoo commented 4 years ago

Is your feature request related to a problem? Please describe. GPU monitoring is always left alone and requires their own hacks or dedicated tools. Having the OT integration with CUDA helps GPU users.

Describe the solution you'd like Add the receiver that integrates with CUDA (NVML) and fetch metrics available from nvidia-smi

Additional context I had researched how GPU users monitors their GPUs and found that they seldom use monitoring backend for it because of the setup cost. It's great if OT helps those users to initiate better monitoring experience.

ymotongpoo commented 4 years ago

Let us discuss the suggestion to have the dedicated repository for GPU metrics extractor binary as suggested in this pull request (https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/393)

In this comment (https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/393#issuecomment-662777421), @tigrannajaryan suggested to create an independent binary to extract GPU metrics that emits Prometheus format data so that the Collector can consume the data without change (using Prometheus receiver) to resolve this issue. The suggestion was because of CGO dependency.

As I commented in my reply (https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/393#issuecomment-664685070), I think OTel can set a repository for the GPU metrics binary with the following reasons:

  1. NVIDIA's GPU is defacto among famous public cloud platforms (i.e. GCP, AWS, Azure, Oracle, IBM, etc.)
  2. Anyway someone needs to write a GPU metrics collection binary to integrate with OpenTelemetry Collector
  3. The feature completes the gap between GPU users and existing OpenTelemetry projects

Though the binary itself is not defined in OTel spec, the objective is to fill the gap between the GPU instance monitoring and OTel and the binary is the complement part for OTel, and I think it makes sense to have the dedicated binary for it.

@mtwo @james-bebbington and @anuraaga are supporters for the idea. @tigrannajaryan @jrcamp @bogdandrutu Can you suggest the process to propose the SIG, Working Group, etc. for this? I couldn't find any from the community README. https://github.com/open-telemetry/community

SergeyKanzhelev commented 4 years ago

@ymotongpoo do we need a separate SIG/WG or simply a repo to host code? Can one of existing SIGs own it?

ymotongpoo commented 4 years ago

Thank you @SergeyKanzhelev for the reaction and I appreciate your support! Because this is a complement feature for the Collector, WG for Collector SIG makes sense to me. Otherwise, just a simple repo should work of course.

dashpole commented 3 years ago

Would something like NVidia's dcgm exporter work? https://github.com/NVIDIA/gpu-monitoring-tools. It can attach pod info as well.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

elgalu commented 1 year ago

@open-telemetry/collector-contrib-triagers this is still relevant, currently OpenTelemetry Collector does not support GPU related metrics however Sentry's OpenTelemetry Collector does collect GPU metrics . Unfortunately it seems this Sentry piece is closed source.

fatsheep9146 commented 1 year ago

@open-telemetry/collector-contrib-triagers this is still relevant, currently OpenTelemetry Collector does not support GPU related metrics however Sentry's OpenTelemetry Collector does collect GPU metrics , since Sentry is open source, is there a way to leverage on their work?

sorry, I do not find the open source for sentry to support GPU metrics. Do you know where to find this? @elgalu

elgalu commented 1 year ago

Sorry that was the wrong link. Should be https://www.sentrysoftware.com/docs/hws-otel-collector/latest/metrics.html but I also cannot find the open source code. Perhaps is closed source then:(

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

MovieStoreGuy commented 1 year ago

I've briefly looked at this since I was interested myself just randomly, the issue being that I can see is that we would need the user to either run with CGO enabled (meaning that the binaries created won't be portable since we will required additional libs to be installed) or rely on calling the underlying command (like nvidia-smi) which some receivers already do but it does make the portability hard in this.

lfpalacios commented 2 months ago

I would like to monitor GPU metrics on EKS with OpenTelemetry. Can we reopen? Is it anything related? https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/#hwgpu---gpu-metrics