TPU utilization could be improved further?

google-research / big_vision

Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.

Apache License 2.0

2.2k stars 147 forks source link

TPU utilization could be improved further? #3

Closed sayakpaul closed 2 years ago

sayakpaul commented 2 years ago

Training details are in https://github.com/google-research/big_vision/issues/2

I think the TPU utilization is a bit lower than expected:

Screenshot 2022-05-11 at 7 42 31 PM

Is this expected?

I understand there might be other network access factors that can contribute to this but wanted to know.

akolesnikoff commented 2 years ago

TPU utilization should be close to 100%. I think your dashboard is showing something else.

My guess it shows percentage of time when TPUs do FLOP-heavy operation, like matrix multiplications. The rest is various data reshapes, weight synchronization and so on. IIUC it is hard to do substantially better than what we have now.

sayakpaul commented 2 years ago

Here's what I am doing.

Go to https://console.cloud.google.com/compute/tpus/detail.
Then on the monitoring panel, I am getting:

I'm probably wrong in selecting that. But you've suggested clues which are enough for now.