trungpx commented 2 years ago

Hello,

When testing your latest code, it shows the faster 10 minutes per epoch for ImageNet compared to version one month ago with BYOL.

However, it shows the wrong tau update as the below figure. Here I report the correct tau as default will increase from 0.99 to 1, train for 100 epochs ImageNet. I saw the blue line (code 1 month ago) is correct when it increases from 0.99-1 as cosine, but the brown line is not correct, it quickly reaches from 0.99 to 1 after few epochs. Note that for the other dataset, it shows correct tau update. I found the new code newly used this "self._num_training_steps = dataset_size // effective_batch_size", is this the main issue??

I think this newly added code has issue: _if self.trainer.global_step > self.last_step:

update momentum backbone and projector

        momentum_pairs = self.momentum_pairs
        for mp in momentum_pairs:
            self.momentum_updater.update(*mp)
        # log tau momentum
        self.log("tau", self.momentum_updater.cur_tau)
        # update tau
        cur_step = self.trainer.global_step
        if self.trainer.accumulate_grad_batches:
            cur_step = cur_step * self.trainer.accumulate_grad_batches
        self.momentum_updater.update_tau(cur_step=cur_step, 
                                        max_steps=self.max_epochs * self.num_training_steps
                                        )_

Compared to previous code: _if self.trainer.global_step > self.last_step:

update momentum backbone and projector

        momentum_pairs = self.momentum_pairs
        for mp in momentum_pairs:
            self.momentum_updater.update(*mp)
        # log tau momentum
        self.log("tau", self.momentum_updater.cur_tau)
        # update tau
        cur_step = self.trainer.global_step
        if self.trainer.accumulate_grad_batches:
            cur_step = cur_step * self.trainer.accumulate_grad_batches
        self.momentum_updater.update_tau(
            cur_step=cur_step,
            max_steps=len(self.trainer.train_dataloader) * self.trainer.max_epochs,
        )_

Figure 1. Wrong tau update when train ImageNet big dataset BYOL with new code of solo-learn. The blue line (code 1 month ago) is correct when it increases from 0.99-1 as cosine, but the brown line is not correct, it quickly reaches from 0.99 to 1 after few epochs

Here is the bash file I ran for BYOL.

python3 ../../../main_pretrain.py \ --dataset imagenet \ --backbone resnet50 \ --data_dir ~/workspace/datasets/ \ --train_dir imagenet/train \ --val_dir imagenet/val \ --max_epochs 100 \ --accelerator gpu \ --strategy ddp \ --sync_batchnorm \ --precision 16 \ --optimizer sgd \ --lars \ --eta_lars 0.001 \ --exclude_bias_n_norm \ --scheduler warmup_cosine \ --lr 0.45 \ --classifier_lr 0.2 \ --accumulate_grad_batches 16 \ --weight_decay 1e-6 \ --batch_size 128 \ --num_workers 4 \ --brightness 0.4 \ --contrast 0.4 \ --saturation 0.2 \ --hue 0.1 \ --gaussian_prob 1.0 0.1 \ --solarization_prob 0.0 0.2 \ --num_crops_per_aug 1 1 \ --name byol_res50_2GPUs \ --project Imagenet1K \ --entity xxx \ --wandb \ --save_checkpoint \ --method byol \ --proj_output_dim 256 \ --proj_hidden_dim 4096 \ --pred_hidden_dim 4096 \ --base_tau_momentum 0.99 \ --final_tau_momentum 1.0 \ --keep_previous_checkpoints \ --checkpoint_frequency 1 \ --dali \ --devices 0,1 \

Could you have a look? Thanks

DonkeyShot21 commented 2 years ago

Hi! I thought I had already fixed the issue some weeks ago, but apparently, it keeps reappearing 🤔 I'll take a look and let you know!

trungpx commented 2 years ago

Hello, thanks for quick reply. I believe it is a different issue as I reported before https://github.com/vturrisi/solo-learn/issues/248 That time, its tau is like a sine wave as below. But in the new issue, it is not really, only occurs for ImageNet pretraining.

DonkeyShot21 commented 2 years ago

I have just launched an experiment on ImageNet, let's see what happens. Thanks for reporting it anyway!

trungpx commented 2 years ago

Nice sir, let's give me an update then. Thanks!

vturrisi commented 2 years ago

@trungpx this should be fixed in #287. One issue I noticed is that we were not counting the last update step when gradient accumulation was on (relying on the new self.trainer.estimated_stepping_batches fixes).

trungpx commented 2 years ago

Thanks for the notice. I gonna give a try to see if any issue remains. I assume that the code github has already been pushed with the mentioned fixes.

DonkeyShot21 commented 2 years ago

Hi! My imagenet experiment is not over yet, but it seems pretty much everything is working correctly, tau goes up as expected, even without the new fix. @trungpx can you please try to pull again and rerun?

trungpx commented 2 years ago

Hi, then let me give it a try.

trungpx commented 2 years ago

It showed the proper tau in my exp. I think it is correct. When it is fully trained, if there has issue I will notice. Thanks so much.

vturrisi / solo-learn

Latest code gives a wrong tau in ImageNet training #286

update momentum backbone and projector

update momentum backbone and projector