Open Xiatian-Zhu opened 3 years ago
It is the last epoch model.
Interesting, what I got is only 65.316% (1 trial). I do not think I changed anything. Can you repeat your results at your end, @taoyang1122 ? Thanks.
I can repeat the results. Can you report your setting and environment? And are you using my provided pre-trained model to do linear evaluation or you first do unsupervised pretraining then do linear evaluation? If you do first do unsupervised pretraining, maybe you can try directly do linear evaluation with my provided model to see which stage went wrong.
Thanks for the response. I did both simsiam pretraining and linear classifier trainining at my side. Indeed, it is a good idea to use your pretrained model. I will do.
The below is the config I used. I use the DP version instead of DDP version which I failed to get work on my machine. Not sure if this is the problem.
| python3 main_simsiam_DP.py \ | --aug-plus \ | --cos \ | -a resnet50 \ | --lr 0.1 \ | -p 100 \ | --epochs 100 \ | --batch-size 512 \ | # --dist-url 'tcp://localhost:10001' \ | # --multiprocessing-distributed \ | # --world-size 1 \ | # --rank 0 \
For training the linear classification, I used the same config as yours (again I commented DDP):
python3 main_lincls.py \ | -a resnet50 \ | --lr 1.6 \ | --cos \ | --epochs 90 \ | --batch-size 4096 \ | -p 100 \ | --pretrained /gpfs-volume/train_logs/simsiam/simsiam_checkpoint_0099.pth.tar \ | # --dist-url 'tcp://localhost:10001' \ | # --multiprocessing-distributed \ | # --world-size 1 \ | # --rank 0 \
Sorry, although I implemented the DP version, I didn't really test its performance. The paper says it uses syncBN so maybe DP will cause some issues.
I will remove the DP version, sorry for the confusion. You can try to get the DDP working and I think that should reproduce the results.
I see. Thanks for letting me know this. It is great to know the reason. I will check the DDP version again.
DDP version is still a problem for my machine. I often got this error. @taoyang1122 Do you have this issue before? or there are special requirements for cuda driver and package version? Thanks!
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/home/user/SimSiam_ImageNet/mlp_main_simsiamddp.py", line 300, in main worker train(train_loader, model, optimizer, epoch, args) File "/home/user/SimSiam_ImageNet/mlp_main_simsiam_ddp.py", line 340, in train z1, p1 = model(images[0]) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py" , line 442, in forward output = self.module(*inputs[0], *kwargs[0]) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(input, kwargs) File "/home/user/SimSiam_ImageNet/models/simsiam.py", line 77, in forward z = self.projection(x) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, kwargs) File "/home/user/SimSiam_ImageNet/models/simsiam.py", line 27, in forward x = self.l1(x) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/container.py", l ine 92, in forward input = module(input) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(input, kwargs) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", l ine 429, in forward self._check_input_dim(input) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", l ine 417, in _check_input_dim .format(input.dim())) ValueError: expected at least 3D input (got 2D input)
No, I don't have this issue. Seems that there is some problem with the input dimension, maybe you can print the input dimension to the encoder and input dimension to the projection MLP to see if it is reasonable.
My Pytorch is 1.7.1, CUDA=11.1, python=3.7, torchvision=0.8.2
Thanks for giving the config of your machine. Mine is: torch=1.3.0, CUDA=10.1.243, python=2.7.12, torchvision=0.4.1 Quite away from your config.
As you suggested, I print the input shape for the project MLP and it is indeed 2D: batch x feat_dim (2048). It is also the same 2D shape for the DP version. The input looks fine to me. This may be caused by different PyTorch versions. Thanks still!
Hi @taoyang1122 I try to use the simsiam model you shared in google drive, named unsuperivsed_petrained.tar. Is it the last epoch output? I cannot load it. Any preprocess needed?
Error is below: checkpoint = torch.load(args.pretrained, map_location="cpu")
File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 426, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 599, in _load
raise RuntimeError("{} is a zip archive (did you mean to use torch.jit.load()?)".format(f.name))
RuntimeError: /gpfs-volume/train_logs/simsiam/taoyang_pretrained.tar is a zip archive (did you mean to use torch.jit.load()?)
I can try to rename it to unsupervised_pretrained.pth.tar.
After renamed to .pth.tar at my side, I still cannot load it with the same error.
That's weird, I don't have such issue. I don't know if it caused by the pytorch version.
Sorry. The problem may be at my side. I can load it on another server. Thanks.
@taoyang1122 With the pre-trained model you provided and the same parameters (below), I can reach a very similar result for linear evaluation: 67.778%. :-)
| -a resnet50 \ | --lr 1.6 \ | --cos \ | --epochs 90 \ | --batch-size 4096 \ | -p 100 \ | --pretrained ./taoyang_pretrained.pth.tar \ | --dist-url 'tcp://localhost:10001' \ | --multiprocessing-distributed \ | --world-size 1 \ | --rank 0 \
Great! You can try to fix the DDP issue and that should be able to reproduce the results from scratch.
How much time did it cost for reproducing this result on ImageNet?
Thanks @taoyang1122 for sharing this great repo.
For the result of 67.8% you got for linear classification on pre-trained feature model by SimSiam, is it by the last (90-th) epoch model or the best epoch model?