stanford-crfm / levanter

Legible, Scalable, Reproducible Foundation Models with Named Tensors and Jax
https://levanter.readthedocs.io/en/latest/
Apache License 2.0
469 stars 71 forks source link

Add TPU Metrics to Weights and Biases Logging #544

Open Helw150 opened 3 months ago

Helw150 commented 3 months ago

For GPUs, I (and I think most folks) am used to debugging memory usage and performance usage using nvidia-smi. TPU's don't have a great equivalent for this. Right now, we log a bunch of other system metrics to Weights and Biases but we don't log TPU metrics!

It looks like there are hookups for TPU monitoring, so it would be a nice QOL improvement to log these to Weights and Biases when jobs are kicked off in a TPU environment.

https://cloud.google.com/tpu/docs/troubleshooting/tpu-vm-monitoring

Helw150 commented 3 months ago

Maybe turning on Profiler is supposed to do this? For some reason in my current runs, even with Profiler on, nothing gets logged to the Profiler log directory (the jobs are admittedly dying suddenly).

dlwh commented 3 months ago

seems like a good idea. probably not gonna do it imminently, but happy to support someone who wants to do it!

Helw150 commented 3 months ago

Yeah, this was more a note for myself to do at some point!