Closed gergopool closed 2 years ago
(1) Weight decay: We use 5e-4 for the small and medium datasets, and 1e-4 for ImageNet. (2) Resnet18 7x7 -> 3x3 Conv is only for small and medium datasets which are not implemented in this codebase.
The experimental setting in this repository is just for ImageNet and we have provided a training script in script/train.sh. We are sorry about the missing hyperparameters in our paper.
For small and medium datasets experiments, we provide a separate codebase in here, please check it out and feel free to ask us if you have any further questions.
BTW, would you like to provide your training setting for the result of 65%? Did you just simply run this codebase directly?
we provide a separate codebase in here
Thank you! I will definitely have a look.
Did you just simply run this codebase directly?
No, I am working in a different repository, here. I've just added the 1e-4 weight decay to the code. Also, I've just discover I used log(softmax(x)) instead of log_softmax(x) when comparing the two distributions, which might have led to unstable computations. I will make re-run in the next few days and get back to you.
Thanks again!
As I was checking your code out, I realized you also shuffled the batch when you were training ReSSL on one gpu. I've never used this technique before, I didn't even think about this case. It's so great you uploaded that zip and I've found this out, because it might have a huge effect on all moving average methods. I've also just discovered you used MaxPooling in ResNet18 and 2048 hidden_dim when training on Tiny-ImageNet. I overlooked all of these.
Thanks again for sharing the code, it helped a lot! :relaxed:
I am very close, 69.04% accuracy. Do you think it's in noise range? 1% sounds a lot, but it might be the case. I will make a step-by-step code revision again, maybe I'll find something.
1% should not be an acceptable noise, let me briefly summarize some key points that I think you should check in the training setting
For pre-training
lr = 0.05
weight_decay = 1e-4
momentum=0.9
teacher temperature = 0.04
student temperature = 0.1
warm up 5 epochs and use cosine schduler
m = 0.999
hidden_dim for projection head = 4096
out_dim for projection head = 512
no bn layer in the projection head !!!
memory buffer size = 131072
batch size = 256, 32 per GPU (shuffle bn might cause different results if you do not strictly follow this setting)
contrastive augmentation:
transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.2, 1.)),
transforms.RandomApply([transforms.ColorJitter(0.4, 0.4, 0.4, 0.1) ], p=0.8),
transforms.RandomGrayscale(p=0.2),
transforms.RandomApply([GaussianBlur([.1, 2.])], p=0.5),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
weak augmentation:
transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.2, 1.)),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
For linear evaluation
change the backbone to evaluation mode !!!!
zero init the linear classifier !!!
batch size = 256
momentum=0.9
learning rate = 0.3
weight_decay = 0
cosine schduler
training augmentation:
transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
eval augmentation:
transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
Thank you!
The only difference on my side is the learning rate. Is that a typo? I've found 0.05 in both the paper and in your code. Also, I used nesterov acceleration at linear evaluation, but that probably won't make much of a difference.
However, I pretrained the network with half precision floats. The loss calculation happened in float32, but the encoder's forward pass was done in float16. I've never seen it causing a difference in supervised setups, but maybe it results in a ~1% decrease in self-supervised trainings. I have a pretrained simsiam network, but haven't run linear evaluation on it, that could be a sanity check.
Yes, the learning rate should be 0.05, sorry about the typo, I'm not quite sure about the effect of fp16 on this codebase.
The evaluation on my 100 epoch simsiam network also looks a bit weaker, it says 67.56% instead of 68.1%. So it's very likely that fp16 is the reason behind the 1% drop. (Interestingly, the swav evaluation protocol performs weekly on simsiam network btw, it really needs 4096bs with LARS).
I think you can close this issue. Thank you for all the codes and detailed answers!
Hi!
In my last issue, I forgot to congratulate to your exceptional paper. I was also looking for a relational method since PAWS, but couldn't really find one that could achieve such high performance on imagenet. Also, this method works with small batch size and very low computational resources due to the frozen target network and single-view backprop. Nice work!
Reading the code I noticed two minor differences to the paper though. Can you please double-check these and clarify which one reflects the results published?
Thank you.
PS.: I am about to reproduce your results from the paper, but currently hanging around 65% on Imagenet.