google-research / big_vision

Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
Apache License 2.0
2.16k stars 147 forks source link

Running out of RAM on cloud TPU when reading data from Cloud Storage #36

Closed izmailovpavel closed 9 months ago

izmailovpavel commented 1 year ago

Hi! I am trying to run the vit_s16_i1k.py script on a TPU-v3-8 machine. I put the data in a google-storage bucket and I am running the following command

TFDS_DATA_DIR=gs://bucket-name/ python3 -m big_vision.train --config big_vision/configs/vit_s16_i1k.py --workdir workdirs/i1k_training_`date '+%m-%d_%H%M'`

The training runs for a few iterations, and then fails with the killed message. When I look at htop outputs, the memory used by the process grows all the way up to 335G available before the process crashes.

I have been able to work around this issue by creating a data disk, mounting it on the TPU VM and putting the data there. In that case the same process only uses 205G of RAM and runs normally.

ajayansaroj17 commented 1 year ago

Try by reducing your Batch size since your workaround of using a data disk and mounting it on the TPU VM seems to have alleviated the issue by reducing memory usage.

lucasb-eyer commented 11 months ago

Only saw this now. Knowing Pavel, I think it's not relevant to him anymore, but for reference, here are two more options, at the slight expense of a little speed:

  1. Set cache_raw to False in the config: https://github.com/google-research/big_vision/blob/c01707f710170baf162dc217db7f1f044fc5be9c/big_vision/configs/vit_s16_i1k.py#L48
  2. For all evaluators, in their config, set cache_final to False: https://github.com/google-research/big_vision/blob/c01707f710170baf162dc217db7f1f044fc5be9c/big_vision/evaluators/classification.py#L59