tensorflow / tpu

Reference models and tools for Cloud TPUs.
https://cloud.google.com/tpu/
Apache License 2.0
5.22k stars 1.77k forks source link

How can I monitor the utilization of TPU on Colab #928

Open GarrettLee opened 3 years ago

GarrettLee commented 3 years ago

Are there something like nvidia-smi?

I have tried https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/profiling_tpus_in_colab.ipynb?authuser=2#scrollTo=mNA__vniyY8e. But the tensorboard page showsNo dashboards are active for the current data set..

mavillan commented 3 years ago

For those who came here with the same problem, you can install: https://pypi.org/project/cloud-tpu-profiler/ and the run this in the colab terminal:

> capture_tpu_profile --service_addr xx.xx.xx.xx:port --monitoring_level 2

the address is displayed when you initialize the tpu service

aLLUPS commented 2 years ago

Can view the address as follows also:

import tensorflow as tf

tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])

Also, you can emulate an terminal in colab following the article, here.

sagelywizard commented 3 months ago

We recently added a CLI for checking this! Check out the tpu-info utility on our TPU runtime.

Output like this:

TPU Chips                                   
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┓
┃ Device      ┃ Type        ┃ Cores ┃ PID  ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━┩
│ /dev/accel0 │ TPU v2 chip │ 2     │ 1344 │
│ /dev/accel1 │ TPU v2 chip │ 2     │ 1344 │
│ /dev/accel2 │ TPU v2 chip │ 2     │ 1344 │
│ /dev/accel3 │ TPU v2 chip │ 2     │ 1344 │
└─────────────┴─────────────┴───────┴──────┘
Connected to libtpu at grpc://localhost:8431...
TPU Chip Utilization                          
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Core ID ┃ Memory usage        ┃ Duty cycle ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ 0       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 1       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 2       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 3       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 4       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 5       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 6       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 7       │ 0.00 GiB / 7.48 GiB │      0.00% │
└─────────┴─────────────────────┴────────────┘
giuliano-97 commented 2 months ago

We recently added a CLI for checking this! Check out the tpu-info utility on our TPU runtime.

Output like this:

TPU Chips                                   
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┓
┃ Device      ┃ Type        ┃ Cores ┃ PID  ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━┩
│ /dev/accel0 │ TPU v2 chip │ 2     │ 1344 │
│ /dev/accel1 │ TPU v2 chip │ 2     │ 1344 │
│ /dev/accel2 │ TPU v2 chip │ 2     │ 1344 │
│ /dev/accel3 │ TPU v2 chip │ 2     │ 1344 │
└─────────────┴─────────────┴───────┴──────┘
Connected to libtpu at grpc://localhost:8431...
TPU Chip Utilization                          
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Core ID ┃ Memory usage        ┃ Duty cycle ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ 0       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 1       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 2       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 3       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 4       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 5       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 6       │ 0.00 GiB / 7.48 GiB │      0.00% │
│ 7       │ 0.00 GiB / 7.48 GiB │      0.00% │
└─────────┴─────────────────────┴────────────┘

@sagelywizard hi, is there a way to install this on a cloud tpu vm?

sagelywizard commented 2 months ago

Yep. pip install git+https://github.com/google/cloud-accelerator-diagnostics/#subdirectory=tpu_info

giuliano-97 commented 2 months ago

Yep. pip install git+https://github.com/google/cloud-accelerator-diagnostics/#subdirectory=tpu_info

awesome, thanks