pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 480 forks source link

hbm OOM #1966

Closed shizhediao closed 4 years ago

shizhediao commented 4 years ago

❓ Questions and Help

Hi thanks for your great work! I am really frustrated recently because I am trying to move a GPU-based model to TPU to get speed up. The model is BigGAN implemented by PyTorch. I use v3-8 and xla-1.5 (also tried nightly), but a really strange problem is hbm OOM. I think my machine has 60GB memory but when I am monitoring the training process, I found that it will get a so long error message when exceeds 16 GB memory.

To help you get a overview of my machine, I tried this: free -h and got
total used free shared buff/cache available Mem: 58G 9.2G 26G 12M 23G 49G Swap: 0B 0B 0B

When I type this command: cat /proc/cpuinfo | grep processor | wc -l I got 16.

Could you help me figure out the problem? Thanks so much!

image

shizhediao commented 4 years ago

By the way, my dataset is a tiny part of ImageNet (just 2000 images). And I could successfully train my model when I set batchsize=1. However, when I set batchsize=8 or 16, it failed with hbm oom. I use v3-8 and the machine type is n1-standard-16.

dlibenzi commented 4 years ago

There is nothing much "strange" in HBM OOM. HBM is device memory, so host memory does not count. It is also impossible to debug from screenshots. If you create a Colab following these guidelines we could take a look:

https://github.com/pytorch/xla/blob/master/contrib/colab/issue-report.ipynb

But, from what I understand, your model might just be too big.

shizhediao commented 4 years ago

There is nothing much "strange" in HBM OOM. HBM is device memory, so host memory does not count. It is also impossible to debug from screenshots. If you create a Colab following these guidelines we could take a look:

https://github.com/pytorch/xla/blob/master/contrib/colab/issue-report.ipynb

But, from what I understand, your model might just be too big.

Thanks for your reply. I'm sorry and could you tell me the difference between device and host? Currently, I just have one VM and one corresponding TPU machine. I will follow the guidelines and create a Colab. Thanks again!

dlibenzi commented 4 years ago

Device memory (what we refer to as HBM) is the memory on the TPU device. In TPU v3-8 there are 8 cores each with 16GB of HBM. That does not mean there are 128GB available in a single memory space, as devices can only access memory that is attached to them. Host memory is the memory of your Cloud VM, where the OS and userspace programs live.

shizhediao commented 4 years ago

Device memory (what we refer to as HBM) is the memory on the TPU device. In TPU v3-8 there are 8 cores each with 16GB of HBM. That does not mean there are 128GB available in a single memory space, as devices can only access memory that is attached to them. Host memory is the memory of your Cloud VM, where the OS and userspace programs live.

Got it. Thanks so much!

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.