Program don't run properly in some epoches with DALI

nanhuayu commented 2 years ago

Hi, I want to speed up the program by DALI, but some problem happened when I ran the program with DALI in ImageNet100 datasets. There should be 130000/(4128bs)400ep=101562 steps, and the num_training_steps showed 253 steps, but there are only a half of the steps (50600 steps) in the final summary. I checked the log info, and found that the program didn't run properly in the odd epoches. Is there some suggestions?

Epoch 0: 100%|██████████| 131/131 [01:01<00:00, 2.13it/s, loss=11.6, v_num=rdy0] Epoch 0: 0%| | 0/131 [00:00<?, ?it/s, loss=11.6, v_num=rdy0] Epoch 1: 0%| | 0/131 [00:00<?, ?it/s, loss=11.6, v_num=rdy0] Epoch 1: 0%| | 0/131 [00:00<?, ?it/s, loss=11.6, v_num=rdy0] Epoch 1: 0%| | 0/131 [00:00<?, ?it/s, loss=11.6, v_num=rdy0] Epoch 2: 0%| | 0/131 [00:00<?, ?it/s, loss=11.6, v_num=rdy0] Epoch 2: 1%| | 1/131 [00:00<00:49, 2.62it/s, loss=11.6, v_num=rdy0] Epoch 2: 1%| | 1/131 [00:00<00:49, 2.61it/s, loss=11.6, v_num=rdy0] Epoch 2: 2%|▏ | 2/131 [00:00<01:03, 2.04it/s, loss=11.6, v_num=rdy0]

main_pretrain.py --dataset imagenet100 --backbone resnet18 --data_dir datasets --train_dir imagenet-100/train --val_dir imagenet-100/val --max_epochs 400 --devices 0,1,2,3 --accelerator gpu --strategy ddp --sync_batchnorm --precision 16 --optimizer lars --grad_clip_lars --eta_lars 0.02 --exclude_bias_n_norm --scheduler warmup_cosine --lr 0.3 --weight_decay 1e-4 --batch_size 256 --num_workers 4 --brightness 0.8 --contrast 0.8 --saturation 0.8 --hue 0.2 --num_crops_per_aug 2 --name simclr-400ep-imagenet100-resnet18-256 --project solo --entity nanhuayu --dali --wandb --save_checkpoint --method simclr --temperature 0.2 --proj_hidden_dim 2048

nanhuayu commented 2 years ago

(Epoch skipping behavior with PyTorch Lightning + DDP](https://github.com/NVIDIA/DALI/issues/3865#) Epochs terminating early incorrectly

after some consideration I think we can add mode="iter" that would attempt to reset the DALI iterator whenever you call iter() on it. It should be more intuitive and aligned with other loaders' behavior.

It seems to be a bug in pl, is there any way to solve this? or I have to use pl with version<1.6.0? @vturrisi @DonkeyShot21

vturrisi commented 2 years ago

I'll try to check this next week.

vturrisi commented 2 years ago

@nanhuayu I'm investigating this to see if they fixed in the last version of dali (they already merged but I'm not sure that this is built into the pip version). The cause for the problem with the learning rate seems to be this and the change in parameters that lighting introduced for 1.6+.

vturrisi commented 2 years ago

The issue has been fixed in #269. Nvidia-dali will properly fix this in 1.16 and we will remove the temporary workaround.

vturrisi commented 1 year ago

@nanhuayu there are other issues with the new PL version, like some methods being deprecated. I plan on pushing a large update (no deadline at the moment) so that we can move to a newer PL version. I'll also pay close attention to how it interacts with Dali.

vturrisi / solo-learn

Program don't run properly in some epoches with DALI #265