pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 478 forks source link

xla-smi similar to nvidia-smi/nvitop or rocm-smi #7844

Open radna0 opened 2 months ago

radna0 commented 2 months ago

🚀 Feature

A CLI tool to show XLA Devices usage

TPU device: xla:0 memory: 0.0 / 16.62 GB
TPU device: xla:1 memory: 0.0 / 16.62 GB
TPU device: xla:2 memory: 0.0 / 16.62 GB
TPU device: xla:3 memory: 0.0 / 16.62 GB
TPU device: xla:4 memory: 0.0 / 16.62 GB
TPU device: xla:5 memory: 0.0 / 16.62 GB
TPU device: xla:6 memory: 0.0 / 16.62 GB
TPU device: xla:7 memory: 0.0 / 16.62 GB
Total TPU memory: 0.0 / 132.96 GB

Motivation

No two worker can use one xla device at the same time, so there's no way to run one script for monitoring and one for utilizing the devices

Pitch

There's no way right now to monitor for TPU/XLA devices usage

Alternatives

nvidia-smi nvitop rocm-smi htop jax-smi

Additional context

JackCaoG commented 2 months ago

@will-cromar we already have the tpu-info cli tools.