Check the state of temporary solution when collecting gpu metrics

We aim to use torch.cuda interface in ts.metrics.system_metrics.collect_gpu_metrics() for amdsmi-related calls but a bug in torch.cuda is preventing us from that.

There exists a fix for this in upstream, which has been merged but is waiting to be released:

https://github.com/pytorch/pytorch/pull/140259

UPDATE: we may have to wait until 29.1.2025 when the release of 2.6.0 is scheduled, see PyTorch Release 2.6.0 | Call for features.

nod-ai / serve

Check the state of temporary solution when collecting gpu metrics #17