Closed tomsal closed 3 years ago
Are you running on a single GPU with 12GB memory? I doubt you can fit an imagenet run on that setup.
In our experiments after some time the memory usage stabilizes, so it's unlikely that this is due to a memory leak. More probably it is due to automatic mixed precision (some param groups might jump from precision 16 to precision 32) or to some internal functioning of Dali that I am not sure about.
EDIT: I see you are using batch size 48, maybe this is not the best choice. Instead, you can try to decrease the number of workers from 12 to maybe 4. This really decreases the amount of memory needed with a negligible slowdown
Yes, I am running it on a single GPU with 12 GB memory, but as you correctly noted with batch size 48. I am aware that in terms of training results this not an ideal setup. It is still good enough for debugging code before going on a multi gpu cluster, I'd say. :)
I will try out the workers, and I am aware that, in general, this is not a major issue. Still, I thought it is good to let you know about this.
Just to add on what @DonkeyShot21 said, DALI's memory usage scales with the number of workers. Every 4 workers per gpu would add an overhead of ~3gb after it stabilizes. I'm not really sure why memory increases after some epochs, because it should stay pretty much the same since we pre-allocate a buffer here https://github.com/vturrisi/solo-learn/blob/532e9a516b1253c86149a01812e81dfe2bd729df/solo/utils/dali_dataloader.py#L187
According to DALI docs (https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html?highlight=host_memory_padding) this should be enough, but we always experienced a small increase in memory usage until ~epoch 60.
Ok, thanks, that's great info! So deactivating DALI should also work out, I guess? I have to admit that I didn't really take DALI into the equation when scaling up the workers.
Yes, if you turn DALI off you will save ~3gbs of memory (when using 4 workers) but you will run around 50% slower. If you scale the workers a lot, I think you can get good performance, but you will use a lot of ram.
Hi,
I ran into an issue that the pretraining script crashes after 8.5 epochs due to an allocation failure. I am guessing there might be a memory leak somewhere.
Details:
python3 main_pretrain.py --dataset imagenet --encoder resnet50 --data_dir /data --train_dir imagenet/train --val_dir imagenet/val --max_epochs 100 --gpus 0 --distributed_backend ddp --sync_batchnorm --precision 16 --optimizer sgd --scheduler warmup_cosine --lr 0.5 --classifier_lr 0.1 --weight_decay 1e-5 --batch_size 48 --num_workers 12 --brightness 0.4 --contrast 0.4 --saturation 0.4 --hue 0.1 --zero_init_residual --name simsiam-resnet50-100ep-imagenet --dali --entity tomsal --project solo-learn --wandb --method simsiam --proj_hidden_dim 2048 --pred_hidden_dim 512 --output_dim 2048 --amp_level O2 --log_gpu_memory all
val_loader
for unrelated reasons (byval_loader = None
just before line 159). No other changes were made.The error I get is the following:
After I ran into this the first time, I reran it with GPU memory logging. This is the plot I get:
I am a bit confused that there is an increase after 3.5k steps (from 11979 GB to 1201GB). Let me know in case, I should provide more logs, or so.
P.S.: Great work! It is a pleasure to work with! :)