Running out of RAM on cloud TPU when reading data from Cloud Storage

izmailovpavel commented 1 year ago

Hi! I am trying to run the vit_s16_i1k.py script on a TPU-v3-8 machine. I put the data in a google-storage bucket and I am running the following command

TFDS_DATA_DIR=gs://bucket-name/ python3 -m big_vision.train --config big_vision/configs/vit_s16_i1k.py --workdir workdirs/i1k_training_`date '+%m-%d_%H%M'`

The training runs for a few iterations, and then fails with the killed message. When I look at htop outputs, the memory used by the process grows all the way up to 335G available before the process crashes.

I have been able to work around this issue by creating a data disk, mounting it on the TPU VM and putting the data there. In that case the same process only uses 205G of RAM and runs normally.

ajayansaroj17 commented 1 year ago

Try by reducing your Batch size since your workaround of using a data disk and mounting it on the TPU VM seems to have alleviated the issue by reducing memory usage.

lucasb-eyer commented 11 months ago

Only saw this now. Knowing Pavel, I think it's not relevant to him anymore, but for reference, here are two more options, at the slight expense of a little speed:

Set cache_raw to False in the config: https://github.com/google-research/big_vision/blob/c01707f710170baf162dc217db7f1f044fc5be9c/big_vision/configs/vit_s16_i1k.py#L48
For all evaluators, in their config, set cache_final to False: https://github.com/google-research/big_vision/blob/c01707f710170baf162dc217db7f1f044fc5be9c/big_vision/evaluators/classification.py#L59

google-research / big_vision

Running out of RAM on cloud TPU when reading data from Cloud Storage #36