Closed davidrzs closed 1 year ago
Hi @davidrzs. Thanks for letting us know.
Can you try to run the previous version, with an older pytorch lightning version, etc? This should work as reported in the results table.
The much lower accuracy can be due to some pytorch lightning bug/version change or was introduced by some change that I did in the latest version. The only thing that comes to my mind is that the LARS that I introduced to the repo is messed up.
400 epochs is more than enough, even with 100 epochs you should get similar performance, around -5% max if I remember correctly.
The curve you posted is really smooth, usually you should have a bit more variance. It's very possibly an optimization problem as Victor suggested. Can you send the loss as well?
I said LARS but actually we didn't change it, we just changed the LR Scheduler. I don't think it introduced any bug, but can you also share the LR plot?
There are two other things that come to my mind. DALI could have changed something. Can you also try without it? Second thing is that training now uses 16-mixed precision instead of the old 16 precision. I noticed that for notebook GPU (1660ti) it simply doesn't work and the outputs of the model are just NaNs. It doesn't give any error or anything like that. Can you try full precision training (this worked for me)?
Here are some screenshots:
I can schedule a 32bit run later tonight or tomorrow with DALI disabled (for the old packages it will take some time as getting these virtual environments to work on the cluster is a real pain due to memory restrictions).
Looking at the learning rate plot, I assume that something is a bit odd there, as we do not have the full progression that we see with other data sets such as the cifar ones.
Let me know if any other screenshots or data would be helpful.
How long did training take you? This run took 17h 32m 40s on 2 Titan Xp cards.
@davidrzs Thanks for all the screenshots. Is this the full training? If so, there's definitely an issue with the LR scheduler and most likely that's the cause of the low performance. It can be that lightning is now calling the scheduler in a different way, I don't know.
Yes, this is the full 400 epochs (i.e. the run exited completely).
The following are the first lines of the log:
wandb: logging graph, to disable use `wandb.watch(log_graph=False)`
/home/zdavid/.local/share/virtualenvs/ssl_pm-B_gN3PY8/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /itet-stor/zdavid/net_scratch/ssl_pm/main_pretrain.p ...
rank_zero_warn(
/home/zdavid/.local/share/virtualenvs/ssl_pm-B_gN3PY8/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /itet-stor/zdavid/net_scratch/ssl_pm/main_pretrain.p ...
rank_zero_warn(
Epoch 0: 100%|██████████| 494/494 [05:03<00:00, 1.63it/s, v_num=rv9c]
/home/zdavid/.local/share/virtualenvs/ssl_pm-B_gN3PY8/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Epoch 2: 100%|██████████| 494/494 [04:58<00:00, 1.66it/s, v_num=rv9c]
Epoch 6: 100%|██████████| 494/494 [04:58<00:00, 1.65it/s, v_num=rv9c]
Validation DataLoader 0: 20%|██ | 4/20 [00:00<00:00, 80.20it/s]
Epoch 8: 100%|██████████| 494/494 [04:58<00:00, 1.65it/s, v_num=rv9c]
Epoch 10: 100%|██████████| 494/494 [04:58<00:00, 1.65it/s, v_num=rv9c]
Epoch 12: 100%|██████████| 494/494 [04:59<00:00, 1.65it/s, v_num=rv9c]
Epoch 14: 100%|██████████| 494/494 [04:59<00:00, 1.65it/s, v_num=rv9c]
Epoch 16: 100%|██████████| 494/494 [04:58<00:00, 1.66it/s, v_num=rv9c]
Epoch 18: 100%|██████████| 494/494 [04:58<00:00, 1.65it/s, v_num=rv9c]
Epoch 20: 100%|██████████| 494/494 [04:58<00:00, 1.65it/s, v_num=rv9c]
Epoch 22: 100%|██████████| 494/494 [04:58<00:00, 1.65it/s, v_num=rv9c]
Epoch 24: 100%|██████████| 494/494 [04:59<00:00, 1.65it/s, v_num=rv9c]
Epoch 26: 100%|██████████| 494/494 [05:06<00:00, 1.61it/s, v_num=rv9c]
Epoch 30: 100%|██████████| 494/494 [04:58<00:00, 1.65it/s, v_num=rv9c]
What is also interesting is that there are 494 batches listed, with which a batch size of 128 gives 63'232 images, which is half the training set size. It would only make sense if 128 is the per GPU batch size giving us an effective 256 batch size scaling to 126464 which is right below the 130'000 images we actually have (as we have 1300 per class I believe). Hence, just as a side question, do you think that the batch size indicated in the yaml file is the total batch size or the per device batch size?
These are the last rows from the log file:
Validation DataLoader 0: 20%|██ | 4/20 [00:00<00:00, 30.50it/s]
Epoch 386: 100%|██████████| 494/494 [04:59<00:00, 1.65it/s, v_num=rv9c]
Epoch 390: 100%|██████████| 494/494 [04:59<00:00, 1.65it/s, v_num=rv9c]
Validation DataLoader 0: 20%|██ | 4/20 [00:00<00:00, 36.57it/s]
Epoch 392: 100%|██████████| 494/494 [04:59<00:00, 1.65it/s, v_num=rv9c]
Epoch 394: 100%|██████████| 494/494 [04:59<00:00, 1.65it/s, v_num=rv9c]
Epoch 396: 100%|██████████| 494/494 [04:59<00:00, 1.65it/s, v_num=rv9c]
Epoch 398: 100%|██████████| 494/494 [04:59<00:00, 1.65it/s, v_num=rv9c]
Epoch 399: 0%| | 0/494 [00:01<?, ?it/s, v_num=rv9c]
(I assume the last line just got truncated.
The batch size in the yaml file is per GPU, that's fine. But this epoch skipping is very strange... Can you train using a single GPU? Even for like 20 epochs in total. Just to see if there's no epoch skipping nor issues with the LR Scheduler.
Me and @DonkeyShot21 were talking and this can be an issue that happened quite a long time ago with Dali + pytorch lightning. This is the issue https://github.com/NVIDIA/DALI/issues/3865. I'll need some time to check again, but if you see what works from the following, it would be great:
I already tested single GPU and Dali, same behaviour: (here both single GPU and the above multi GPU are plotted):
Will try a "non-Dali" run and report back.
This is from a non-Dali run on two Titan Xp cards:
In the images you can see the previous with Dali run and the new still running non-Dali run:
Do I interpret the epoch chart correctly, that somehow the Dali run gets twice the number of epochs per global trainer step? (which would be weird)
If the issue is the same as before, Dali calls some reset method after each epoch, which then triggers an "epoch skip" from lightning. This looks very much to be the case. I'll check what is the proper fix in the next couple of days.
I let it run overnight (still running, will let it complete) and we are now around epoch 160:
The following are the graphs:
I feel like the accuracy is still lower than where it should be given that soon half of the training will be done?
@davidrzs I checked my old runs, and it seems a bit lower than the run that I have (not so significant but we should wait until the run is finished).
@davidrzs I've just confirmed the issue on my end. For now, use pytorch lightning==1.9.0. I'm pushing a small temporary update to set the version to 1.9.0 because I don't want the repo to stay on an unstable version. Can you try to re-run your experiment again with patch?
I have just scheduled the same model on the cluster with Dali enabled (will report in due time).
The last run (without Dali) terminated after 380 epochs (we have a 48h limit on the cluster) and the following are the plots:
Top1-train seems to be okay, but top1 val seems to be way too low. Do you have any idea why? Will investigate on Wednesday a bit.
Hummm. It can be the change that I added to comply with lightning>=2.0. I'll check that soon (if that's the case, it's only an evaluation issue).
Btw, the root cause of the epoch skipping behavior comes from dropping the last batch. I've already reported that in another issue and hopefully we get that fixed soon.
Can confirm that it works with Dali on pytorch lightning==1.9.0. (2gpus)
See screenshots:
Though, the evaluation issue persists. Do you have any idea what could cause the eval issue? That way I could also start to investigate.
@davidrzs I'll push an update that will fix DALI for lightning>2.0. About the accuracy issue, lightning changed how you compute epoch-wise metrics. Beforehand, you just returned it in the validation_step
and you would receive it as parameter in another callback. Now, you need to store the metrics in a list and remember manually clean it. I simply forgot to call .clear()
on it. The push will also fix this. I'll just wait for the tests to finish and merge it with the main branch. Can you re-run your code? You should also be able to run with DALI and lightning=2.0.2.
Thank you very much for your work on this! Will rerun it after you merged with main and report tomorrow.
Merged it. The modification is really simple so hopefully, it's fine.
Can confirm that it works (learning rate as well as validation accuracies). Thanks for fixing it!
Though, you might want to update pytorch lightning in the requirements file. (I manually updated and let it run on 2.0.2
Super! It's always annoying to update library versions, but I'm glad it's working now :). I'll also update the requirements file, thought I already had.
Describe the bug Might not be a bug but your guidance is valued:
I tried to reproduce the imagenet100 results using Barlow Twins and get accuracies well below the officially reported figures. The model trained for 400 epochs with the default configuration on 2 GPUs, but even when just training on 1 GPU the same accuracies can be observed. I suspect that 400 epochs is not enough (as the validation curve keeps increasing), hence I wanted to check if you really trained the models for 400 epochs.
I suspect I am doing something wrong but I am unsure what my mistake is. Thanks for your guidance.
For the dataset: We used these classes as referenced here.
To Reproduce
python /ssl_pm/main_pretrain.py --config-path scripts/pretrain/imagenet-100/ --config-name barlow.yaml
Screenshots
Versions Latest version (pulled 2 days ago)
-- here you have my
barlow.yaml
file for reference: