median-research-group / LibMTL

A PyTorch Library for Multi-Task Learning
MIT License
1.94k stars 181 forks source link

Identical result for CAGrad and MoCo #56

Closed azj-n closed 10 months ago

azj-n commented 11 months ago

Hello and thank you for this library. When checking the training results for CAGrad and MoCo on the Office-31 dataset, it is yielding the exact same results epoch by epoch with the default config values. Am I missing something?

Baijiong-Lin commented 11 months ago

What is your running command? I have tested it and the results of them are different.

image

image

azj-n commented 11 months ago

Screenshot 2023-10-11 153259 Screenshot 2023-10-11 153423

Not only they are identical, but different than your values. I run:

python train_office.py --weighting CAGRad --arch HPS --dataset_path 'dataset' --gpu_id 0 --multi_input python train_office.py --weighting MoCo --arch HPS --dataset_path 'dataset' --gpu_id 1 --multi_input

I checked the parameter update for the first 5 epochs and they are also identical, not sure why I am getting these results. (The double 0th epoch on the second pic is i pasted to compare the values)

Baijiong-Lin commented 11 months ago

The running commands you used are correct. I have no idea about your problem. Have you tried any other weighting methods?

azj-n commented 11 months ago

yes, the other weighting methods seem to work fine.

azj-n commented 10 months ago

I have created a new setup on a separate environment but have the latest versions of torch and torchvision, which yielded the same results again. Could this be causing the problem, the torch version? I had to change torchvison.models.utils to torch.hub. Also i am wondering the same seed, for example seed 0 should be the same for everyone right?

Baijiong-Lin commented 10 months ago

I have cloned the latest LibMTL repo and reran the experiments. The results are the same as I provided before.

My env: torch==1.8.1+cu111, torchvision==0.9.1+cu111, RTX 3090 GPU

image

image

azj-n commented 10 months ago

hello, i have used pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html to install the exact version and my GPU is NVIDIA RTX A6000. Finally I have different results for Moco and CAGrad, but its still slightly different from your results for seed 0. is it supposed to be exactly the same for everyone or is it normal to slightly differ?

image

image

Baijiong-Lin commented 10 months ago

Does it mean this problem is caused by torch? It is amazing. Which versions of torch and torchvision have you used before? I will mark this bug.

As for the reproducibility problem, we have controlled the randomness in the following code according to https://pytorch.org/docs/stable/notes/randomness.html. https://github.com/median-research-group/LibMTL/blob/f10f7c9ffb72138a4ffae150330fb653da3b7456/LibMTL/utils.py#L9-L20

azj-n commented 10 months ago

yes, it seems to be the problem. For the previous results, I used torch==2.0.1 torchvision==0.15.2, in which I had to change an import statement and loader (trainer.py - line 144 a.next() to next(a) resnter.py - line 3 torchvison.models.utils to torch.hub ) Also the same for the latest torch.

Baijiong-Lin commented 10 months ago

Closed as no further updates.

Baijiong-Lin commented 8 months ago

@azj-n Hi, I guess this bug is caused by the default of set_to_none in zero_grad() in different torch versions.

set_to_none=True in torch2 (https://pytorch.org/docs/2.0/generated/torch.optim.Optimizer.zero_grad.html?highlight=zero_grad)

while set_to_none=False in torch1.8.1 (https://pytorch.org/docs/1.8.1/optim.html?highlight=zero_grad#torch.optim.Optimizer.zero_grad)

If setting the grad as None when calling https://github.com/median-research-group/LibMTL/blob/b1ff34d1bc72a208ef4f42301e6021db42913653/LibMTL/weighting/abstract_weighting.py#L35-L50 then the grad resetting for backbone/encoder will fail https://github.com/median-research-group/LibMTL/blob/b1ff34d1bc72a208ef4f42301e6021db42913653/LibMTL/weighting/abstract_weighting.py#L62-L69

in other words, the parameters of the encoder will not be updated and only the decoders are updated.