Closed sayakpaul closed 2 years ago
TPU utilization should be close to 100%. I think your dashboard is showing something else.
My guess it shows percentage of time when TPUs do FLOP-heavy operation, like matrix multiplications. The rest is various data reshapes, weight synchronization and so on. IIUC it is hard to do substantially better than what we have now.
Here's what I am doing.
I'm probably wrong in selecting that. But you've suggested clues which are enough for now.
Training details are in https://github.com/google-research/big_vision/issues/2
I think the TPU utilization is a bit lower than expected:
Is this expected?
I understand there might be other network access factors that can contribute to this but wanted to know.