Closed shizhediao closed 4 years ago
By the way, my dataset is a tiny part of ImageNet (just 2000 images). And I could successfully train my model when I set batchsize=1. However, when I set batchsize=8 or 16, it failed with hbm oom. I use v3-8 and the machine type is n1-standard-16.
There is nothing much "strange" in HBM OOM. HBM is device memory, so host memory does not count. It is also impossible to debug from screenshots. If you create a Colab following these guidelines we could take a look:
https://github.com/pytorch/xla/blob/master/contrib/colab/issue-report.ipynb
But, from what I understand, your model might just be too big.
There is nothing much "strange" in HBM OOM. HBM is device memory, so host memory does not count. It is also impossible to debug from screenshots. If you create a Colab following these guidelines we could take a look:
https://github.com/pytorch/xla/blob/master/contrib/colab/issue-report.ipynb
But, from what I understand, your model might just be too big.
Thanks for your reply. I'm sorry and could you tell me the difference between device and host? Currently, I just have one VM and one corresponding TPU machine. I will follow the guidelines and create a Colab. Thanks again!
Device memory (what we refer to as HBM) is the memory on the TPU device. In TPU v3-8 there are 8 cores each with 16GB of HBM. That does not mean there are 128GB available in a single memory space, as devices can only access memory that is attached to them. Host memory is the memory of your Cloud VM, where the OS and userspace programs live.
Device memory (what we refer to as HBM) is the memory on the TPU device. In TPU v3-8 there are 8 cores each with 16GB of HBM. That does not mean there are 128GB available in a single memory space, as devices can only access memory that is attached to them. Host memory is the memory of your Cloud VM, where the OS and userspace programs live.
Got it. Thanks so much!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
❓ Questions and Help
Hi thanks for your great work! I am really frustrated recently because I am trying to move a GPU-based model to TPU to get speed up. The model is BigGAN implemented by PyTorch. I use v3-8 and xla-1.5 (also tried nightly), but a really strange problem is hbm OOM. I think my machine has 60GB memory but when I am monitoring the training process, I found that it will get a so long error message when exceeds 16 GB memory.
To help you get a overview of my machine, I tried this:
free -h
and gottotal used free shared buff/cache available Mem: 58G 9.2G 26G 12M 23G 49G Swap: 0B 0B 0B
When I type this command:
cat /proc/cpuinfo | grep processor | wc -l
I got 16.Could you help me figure out the problem? Thanks so much!