Dear Author,
thank you for sharing this amazing piece of work :) 👍
I have a question about the paper where it says:
"We carry out grid-search of batch size ∈ {64, 128, 256, 512} and learning rate ∈ {1e-5, 3e-5, 5e-5} on STS-B development set and adopt the hyper-parameter settings in Table A.1. We find that SimCSE is not sensitive to batch sizes as long as tuning the learning rates accordingly, which contradicts the finding that contrastive learning requires large batch sizes."
The part "as long as tuning the learning rates accordingly." concerns me. For batch size of 512, what learning rate would you recommend (for unsupervised SimCSE)? It seems like it is not that easy to figure it out 😢 In your experiments, what learning rates did you use for different batch sizes to confirm such insensitivity?
We simply tried all combinations of (64, 128, 256, 512) x (1e-5, 3e-5, 5e-5) and picked the best. We provided the combinations we used in Appendix A of our paper.
Dear Author, thank you for sharing this amazing piece of work :) 👍
I have a question about the paper where it says: "We carry out grid-search of batch size ∈ {64, 128, 256, 512} and learning rate ∈ {1e-5, 3e-5, 5e-5} on STS-B development set and adopt the hyper-parameter settings in Table A.1. We find that SimCSE is not sensitive to batch sizes as long as tuning the learning rates accordingly, which contradicts the finding that contrastive learning requires large batch sizes."
The part "as long as tuning the learning rates accordingly." concerns me. For batch size of 512, what learning rate would you recommend (for unsupervised SimCSE)? It seems like it is not that easy to figure it out 😢 In your experiments, what learning rates did you use for different batch sizes to confirm such insensitivity?
Thank you in advance! :)