Imagenet100 Reproduction Accuracy considerably lower than reported.

davidrzs commented 1 year ago

Describe the bug Might not be a bug but your guidance is valued:

I tried to reproduce the imagenet100 results using Barlow Twins and get accuracies well below the officially reported figures. The model trained for 400 epochs with the default configuration on 2 GPUs, but even when just training on 1 GPU the same accuracies can be observed. I suspect that 400 epochs is not enough (as the validation curve keeps increasing), hence I wanted to check if you really trained the models for 400 epochs.

I suspect I am doing something wrong but I am unsure what my mistake is. Thanks for your guidance.

For the dataset: We used these classes as referenced here.

To Reproduce python /ssl_pm/main_pretrain.py --config-path scripts/pretrain/imagenet-100/ --config-name barlow.yaml

Screenshots

Versions Latest version (pulled 2 days ago)

-- here you have my barlow.yaml file for reference:

defaults:
  - _self_
  - augmentations: asymmetric.yaml
  - wandb: private.yaml
  - override hydra/hydra_logging: disabled
  - override hydra/job_logging: disabled

# disable hydra outputs
hydra:
  output_subdir: null
  run:
    dir: .

name: "barlow_twins-imagenet100"
method: "barlow_twins"
backbone:
  name: "resnet18"
method_kwargs:
  proj_hidden_dim: 2048
  proj_output_dim: 2048
  scale_loss: 0.1
data:
  dataset: imagenet100
  train_path: "/itet-stor/zdavid/net_scratch/ILSVRC2012_img_train_100" #"/home/david/data/imagenet_100/train"
  val_path: "/itet-stor/zdavid/net_scratch/ILSVRC2012_img_val_100" #"/home/david/data/imagenet_100/val"
  format: "dali"
  num_workers: 4
optimizer:
  name: "lars"
  batch_size: 128
  lr: 0.3
  classifier_lr: 0.1
  weight_decay: 1e-4
  kwargs:
    clip_lr: True
    eta: 0.02
    exclude_bias_n_norm: True
scheduler:
  name: "warmup_cosine"
checkpoint:
  enabled: True
  dir: "trained_models"
  frequency: 1
auto_resume:
  enabled: False

# overwrite PL stuff
max_epochs: 400
devices: [0, 1]
sync_batchnorm: True
accelerator: "gpu"
strategy: "ddp"
precision: 16-mixed

vturrisi commented 1 year ago

Hi @davidrzs. Thanks for letting us know.

Can you try to run the previous version, with an older pytorch lightning version, etc? This should work as reported in the results table.

The much lower accuracy can be due to some pytorch lightning bug/version change or was introduced by some change that I did in the latest version. The only thing that comes to my mind is that the LARS that I introduced to the repo is messed up.

400 epochs is more than enough, even with 100 epochs you should get similar performance, around -5% max if I remember correctly.

DonkeyShot21 commented 1 year ago

The curve you posted is really smooth, usually you should have a bit more variance. It's very possibly an optimization problem as Victor suggested. Can you send the loss as well?

vturrisi commented 1 year ago

I said LARS but actually we didn't change it, we just changed the LR Scheduler. I don't think it introduced any bug, but can you also share the LR plot?

vturrisi commented 1 year ago

There are two other things that come to my mind. DALI could have changed something. Can you also try without it? Second thing is that training now uses 16-mixed precision instead of the old 16 precision. I noticed that for notebook GPU (1660ti) it simply doesn't work and the outputs of the model are just NaNs. It doesn't give any error or anything like that. Can you try full precision training (this worked for me)?

davidrzs commented 1 year ago

Here are some screenshots:

I can schedule a 32bit run later tonight or tomorrow with DALI disabled (for the old packages it will take some time as getting these virtual environments to work on the cluster is a real pain due to memory restrictions).

Looking at the learning rate plot, I assume that something is a bit odd there, as we do not have the full progression that we see with other data sets such as the cifar ones.

Let me know if any other screenshots or data would be helpful.

How long did training take you? This run took 17h 32m 40s on 2 Titan Xp cards.

vturrisi commented 1 year ago

@davidrzs Thanks for all the screenshots. Is this the full training? If so, there's definitely an issue with the LR scheduler and most likely that's the cause of the low performance. It can be that lightning is now calling the scheduler in a different way, I don't know.

davidrzs commented 1 year ago

Yes, this is the full 400 epochs (i.e. the run exited completely).

The following are the first lines of the log:

wandb: logging graph, to disable use `wandb.watch(log_graph=False)`
/home/zdavid/.local/share/virtualenvs/ssl_pm-B_gN3PY8/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /itet-stor/zdavid/net_scratch/ssl_pm/main_pretrain.p ...
  rank_zero_warn(
/home/zdavid/.local/share/virtualenvs/ssl_pm-B_gN3PY8/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /itet-stor/zdavid/net_scratch/ssl_pm/main_pretrain.p ...
  rank_zero_warn(
Epoch 0: 100%|██████████| 494/494 [05:03<00:00,  1.63it/s, v_num=rv9c]
/home/zdavid/.local/share/virtualenvs/ssl_pm-B_gN3PY8/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Epoch 2: 100%|██████████| 494/494 [04:58<00:00,  1.66it/s, v_num=rv9c]
Epoch 6: 100%|██████████| 494/494 [04:58<00:00,  1.65it/s, v_num=rv9c]
Validation DataLoader 0:  20%|██        | 4/20 [00:00<00:00, 80.20it/s]
Epoch 8: 100%|██████████| 494/494 [04:58<00:00,  1.65it/s, v_num=rv9c]
Epoch 10: 100%|██████████| 494/494 [04:58<00:00,  1.65it/s, v_num=rv9c]
Epoch 12: 100%|██████████| 494/494 [04:59<00:00,  1.65it/s, v_num=rv9c]
Epoch 14: 100%|██████████| 494/494 [04:59<00:00,  1.65it/s, v_num=rv9c]
Epoch 16: 100%|██████████| 494/494 [04:58<00:00,  1.66it/s, v_num=rv9c]
Epoch 18: 100%|██████████| 494/494 [04:58<00:00,  1.65it/s, v_num=rv9c]
Epoch 20: 100%|██████████| 494/494 [04:58<00:00,  1.65it/s, v_num=rv9c]
Epoch 22: 100%|██████████| 494/494 [04:58<00:00,  1.65it/s, v_num=rv9c]
Epoch 24: 100%|██████████| 494/494 [04:59<00:00,  1.65it/s, v_num=rv9c]
Epoch 26: 100%|██████████| 494/494 [05:06<00:00,  1.61it/s, v_num=rv9c]
Epoch 30: 100%|██████████| 494/494 [04:58<00:00,  1.65it/s, v_num=rv9c]

What is also interesting is that there are 494 batches listed, with which a batch size of 128 gives 63'232 images, which is half the training set size. It would only make sense if 128 is the per GPU batch size giving us an effective 256 batch size scaling to 126464 which is right below the 130'000 images we actually have (as we have 1300 per class I believe). Hence, just as a side question, do you think that the batch size indicated in the yaml file is the total batch size or the per device batch size?

These are the last rows from the log file:

Validation DataLoader 0:  20%|██        | 4/20 [00:00<00:00, 30.50it/s]
Epoch 386: 100%|██████████| 494/494 [04:59<00:00,  1.65it/s, v_num=rv9c]
Epoch 390: 100%|██████████| 494/494 [04:59<00:00,  1.65it/s, v_num=rv9c]
Validation DataLoader 0:  20%|██        | 4/20 [00:00<00:00, 36.57it/s]
Epoch 392: 100%|██████████| 494/494 [04:59<00:00,  1.65it/s, v_num=rv9c]
Epoch 394: 100%|██████████| 494/494 [04:59<00:00,  1.65it/s, v_num=rv9c]
Epoch 396: 100%|██████████| 494/494 [04:59<00:00,  1.65it/s, v_num=rv9c]
Epoch 398: 100%|██████████| 494/494 [04:59<00:00,  1.65it/s, v_num=rv9c]
Epoch 399:   0%|          | 0/494 [00:01<?, ?it/s, v_num=rv9c]

(I assume the last line just got truncated.

vturrisi commented 1 year ago

The batch size in the yaml file is per GPU, that's fine. But this epoch skipping is very strange... Can you train using a single GPU? Even for like 20 epochs in total. Just to see if there's no epoch skipping nor issues with the LR Scheduler.

vturrisi commented 1 year ago

Me and @DonkeyShot21 were talking and this can be an issue that happened quite a long time ago with Dali + pytorch lightning. This is the issue https://github.com/NVIDIA/DALI/issues/3865. I'll need some time to check again, but if you see what works from the following, it would be great:

Dali + single GPU
No Dali + single GPU
No Dali and multi GPU Just need to run for like 20 epochs so we go past the warmup of the LR Scheduler. We shouldn't see any epoch skipping nor issues with the LR plot.

davidrzs commented 1 year ago

I already tested single GPU and Dali, same behaviour: (here both single GPU and the above multi GPU are plotted):

Will try a "non-Dali" run and report back.

davidrzs commented 1 year ago

This is from a non-Dali run on two Titan Xp cards:

In the images you can see the previous with Dali run and the new still running non-Dali run:

Do I interpret the epoch chart correctly, that somehow the Dali run gets twice the number of epochs per global trainer step? (which would be weird)

vturrisi commented 1 year ago

If the issue is the same as before, Dali calls some reset method after each epoch, which then triggers an "epoch skip" from lightning. This looks very much to be the case. I'll check what is the proper fix in the next couple of days.

davidrzs commented 1 year ago

I let it run overnight (still running, will let it complete) and we are now around epoch 160:

The following are the graphs:

I feel like the accuracy is still lower than where it should be given that soon half of the training will be done?

vturrisi commented 1 year ago

@davidrzs I checked my old runs, and it seems a bit lower than the run that I have (not so significant but we should wait until the run is finished).

vturrisi commented 1 year ago

@davidrzs I've just confirmed the issue on my end. For now, use pytorch lightning==1.9.0. I'm pushing a small temporary update to set the version to 1.9.0 because I don't want the repo to stay on an unstable version. Can you try to re-run your experiment again with patch?

davidrzs commented 1 year ago

I have just scheduled the same model on the cluster with Dali enabled (will report in due time).

The last run (without Dali) terminated after 380 epochs (we have a 48h limit on the cluster) and the following are the plots:

Top1-train seems to be okay, but top1 val seems to be way too low. Do you have any idea why? Will investigate on Wednesday a bit.

vturrisi commented 1 year ago

Hummm. It can be the change that I added to comply with lightning>=2.0. I'll check that soon (if that's the case, it's only an evaluation issue).

Btw, the root cause of the epoch skipping behavior comes from dropping the last batch. I've already reported that in another issue and hopefully we get that fixed soon.

davidrzs commented 1 year ago

Can confirm that it works with Dali on pytorch lightning==1.9.0. (2gpus)

See screenshots:

Though, the evaluation issue persists. Do you have any idea what could cause the eval issue? That way I could also start to investigate.

vturrisi commented 1 year ago

@davidrzs I'll push an update that will fix DALI for lightning>2.0. About the accuracy issue, lightning changed how you compute epoch-wise metrics. Beforehand, you just returned it in the validation_step and you would receive it as parameter in another callback. Now, you need to store the metrics in a list and remember manually clean it. I simply forgot to call .clear() on it. The push will also fix this. I'll just wait for the tests to finish and merge it with the main branch. Can you re-run your code? You should also be able to run with DALI and lightning=2.0.2.

davidrzs commented 1 year ago

Thank you very much for your work on this! Will rerun it after you merged with main and report tomorrow.

vturrisi commented 1 year ago

Merged it. The modification is really simple so hopefully, it's fine.

davidrzs commented 1 year ago

Can confirm that it works (learning rate as well as validation accuracies). Thanks for fixing it!

Though, you might want to update pytorch lightning in the requirements file. (I manually updated and let it run on 2.0.2

vturrisi commented 1 year ago

Super! It's always annoying to update library versions, but I'm glad it's working now :). I'll also update the requirements file, thought I already had.

vturrisi / solo-learn

Imagenet100 Reproduction Accuracy considerably lower than reported. #346