tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.61k stars 3.51k forks source link

TPU HBM OOM #1807

Open wppply opened 4 years ago

wppply commented 4 years ago

Hi. I am trying to use TPUv2-8 to train a query classifier. However, I got some issues here about memory.

Officially, it claims that TPUv2-8 has 64 GB memory. However, I kept getting this error when I full this tutorial. It cannot handle more than 8GB.

INFO:tensorflow:Error recorded from training_loop: Compilation failure: Ran out of memory in memory space hbm. Used 8.83G of 8.00G hbm. Exceeded hbm capacity by 848.88M.

Total hbm usage >= 8.83G:
    reserved        528.00M
    program           8.25G
    arguments        64.32M (99.9% utilization)

Output size 64.32M (99.9% utilization); shares 64.25M with arguments.

Program hbm requirement 8.25G:
    reserved           4.0K
    global            65.0K
    HLO temp          8.25G (100.0% utilization, 0.0% fragmentation (1.01M))

  Largest program allocations in hbm:

  1. Size: 4.00G
     Operator: op_name="XLA_Args"
     Shape: bf16[256,2048,4096]{2,1,0}
     Unpadded size: 4.00G
     XLA label: %arg_tuple.1996.1402 = (s32[], s32[], f32[], f32[4,1024]{1,0}, bf16[4,1024]{1,0}, f32[4,1024]{1,0}, bf16[4,1024]{1,0}, s32[4]{0}, s32[], s32[], f32[4,1024]{1,0}, f32[], bf16[], bf16[], s32[], bf16[2048,4096]{1,0}, bf16[4096]{0}, bf16[2048,4096]{1,0}, bf16[...
     Allocation type: HLO temp
...

here is the usage of TPU, the usage is quite low as well.

 b'  TPU type: TPU v2\n  Number of TPU cores: 8 (Replica count = 8, num cores per replica = 1)\n  TPU idle time (lower is better): 0.009%\n  Utilization of TPU Matrix Units (higher is better): 32.1%\n  Step time: 58.6ms (avg), 58.4ms (min), 58.9ms (max)\n  Infeed percentage: 0.010% (avg), 0.009% (min), 0.010% (max)\n\n'

I thought the TPU would spilt the batch equally to each core, but seems not, it is only using single one. When I try to use a single Nvidia T4 to run the same code. There is nothing wrong with it. So, what should I add to the code or the CLI option to leverage FULL 8 TPU CORE instead of a single one right now? Thanks

juneoh commented 4 years ago

When training tensor2tensor on TPU, the actual global batch size is automatically calculated by batch_size * tpu_config.num_shards. Also, the memory usage shown accounts only for a single core out of the 8 replicas. Hence, 'Number of TPU cores: 8' means that you are already effectively using 8 cores, with 8 times the batch size you have specified, and the total global HBM usage is 8.83*8=70.64GB.

As per utilization of 32.1%, it's similar to what I've seen running t2t on TPU. Although it may seem low, you'll see that it's speed is still much faster than GPU. You can use Cloud TPU Tools to grab a deeper, op-by-op profile.