nod-ai / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
1 stars 0 forks source link

parse metrics failed: collecting AMD GPU metrics #11

Closed smedegaard closed 1 week ago

smedegaard commented 2 weeks ago

When running TorchServe, before printing out AMD GPU metrics, this message appears:

[WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - Parse metrics failed: Collecting AMD GPU metrics... A small semantic thing, but could this be fixed in a way that it wouldn’t say “Parse metrics failed” at this point, and just proceeds to printing the AMD GPU metrics?

How to Reproduce

torchserve --start --model-store model_store --models densenet161=densenet161.mar --disable-token-auth --enable-model-api

jakki-amd commented 1 week ago

Cause might be because metric manager tries to get Nvidia metrics first, but fails to do so with AMD HW.