nod-ai / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
1 stars 0 forks source link

Check the state of temporary solution when collecting gpu metrics #17

Open eppane opened 2 weeks ago

eppane commented 2 weeks ago

We aim to use torch.cuda interface in ts.metrics.system_metrics.collect_gpu_metrics() for amdsmi-related calls but a bug in torch.cuda is preventing us from that.

There exists a fix for this in upstream, which has been merged but is waiting to be released:

https://github.com/pytorch/pytorch/pull/140259

UPDATE: we may have to wait until 29.1.2025 when the release of 2.6.0 is scheduled, see PyTorch Release 2.6.0 | Call for features.