TPU V4-32 Failed to get global TPU topology.

radna0 commented 2 months ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cpu
pip install 'torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.5.0.dev-cp310-cp310-linux_x86_64.whl' -f https://storage.googleapis.com/libtpu-releases/index.html


import torch
import torch_xla
import torch_xla.core.xla_model as xm

devices = xm.get_xla_supported_devices() print(f"Devices: {devices}") total = { 0: 0, 1: 0 } for device in devices: mem = round(xm.get_memory_info(device)["bytes_limit"] / 1e9, 2) total[1] += mem print(f'Total TPU device: {device} memory: {mem} GB')

print(f"Total TPU memory: {total[0]} / {total[1]} GB")

for device in devices: mem = round(xm.get_memory_info(device)["bytes_limit"] / 1e9, 2) t = torch.randn(torch.randint(1, 8, (1,)), 4, 144, 720, 1280).to(device) mem_used = round(xm.get_memory_info(device)["bytes_used"] / 1e9, 2) total[0] += mem_used print(f'Total TPU device: {device} memory: {mem_used} / {mem} GB') xm.mark_step()

print(f"Total TPU memory: {total[0]} / {total[1]} GB")

2. python mem_check.py
3. Error:

Traceback (most recent call last): File "/home/kojoe/mem_check.py", line 6, in devices = xm.get_xla_supported_devices() File "/home/kojoe/.local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 93, in get_xla_supported_devices devices = torch_xla._XLAC._xla_get_devices() RuntimeError: Bad StatusOr access: INTERNAL: Failed to get global TPU topology.



<!-- If you have a code sample, error messages, stack traces, please provide it here as well. Or better use the Colab template: https://github.com/pytorch/xla/blob/master/contrib/colab/issue-report.ipynb -->

## Expected behavior

<!-- A clear and concise description of what you expected to happen. -->
to check memory usage

## Environment

 - Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
 - torch_xla version: NIGHTLY 2.5

## Additional context

<!-- Add any other context about the problem here. -->

JackCaoG commented 2 months ago

@ManfeiBai can you take a look?

ManfeiBai commented 2 months ago

thanks, sure, will take a look

ManfeiBai commented 2 months ago

Hi, tried on v4-32 locally, due to v4-32 is multi host device, used commands to all workers like: https://gist.github.com/ManfeiBai/3a2ac89435dbb7a9914e34d24b8449ba

code above finished and printed: https://gist.github.com/ManfeiBai/cb23bb15850c8167320e910cf4b3f95c

Hi, @radna0, would you mind try commands like https://gist.github.com/ManfeiBai/3a2ac89435dbb7a9914e34d24b8449ba on your local v4-32 again? or would you mind share your commands so that I could try to reproduce locally again too, please let us know if any updates

radna0 commented 2 months ago

Here's what I got running the script @ManfeiBai tpu_v4_logs.txt

pytorch / xla

TPU V4-32 Failed to get global TPU topology. #7939

🐛 Bug

To Reproduce