Why the max_sequence_length is just 32

princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821

MIT License

3.37k stars 511 forks source link

Why the max_sequence_length is just 32 #112

Closed leoozy closed 2 years ago

leoozy commented 2 years ago

Hello, I noticed that the max_sequence_length in your code is set to 32. But the number of tokens of most of sentences in Eng WIKI exceed 32. Why the max sequence_length is 32? Thank you

gaotianyu1350 commented 2 years ago

Hi,

This is because all the datasets we used mostly have sequence lengths lower than 32. For efficient reason, we set the max length as 32.

TrieuLe0801 commented 2 years ago

Hello, Actually we can modify the max_seq_len. I tried configuring with Transformers. But I got CUDA out of memory with sentence_transformers when I tried to increase the max_seq_len

gaotianyu1350 commented 2 years ago

If you encountered CUDA out of memory, you should decrease the sequence length or batch size.

voronoiwanxyz commented 2 years ago

Hi,

This is because all the datasets we used mostly have sequence lengths lower than 32. For efficient reason, we set the max length as 32.

For the unsupervised version, the training max_seq_len is still confined to a certain number as 32 for the infer stage? If not, is there any truncated strategy for the extreme long queries appeared in the wiki train dataset .

gaotianyu1350 commented 2 years ago

If the sentence is longer than the set max length, it will be truncated.

TrieuLe0801 commented 2 years ago

If you encountered CUDA out of memory, you should decrease the sequence length or batch size.

Yes you are right. But in my case, max len 256 for input is the main requirement so I cannot decrease it. For the batch size, I tried with 1 similar pair per batch but It still got the issue

vincentwu0730 commented 2 years ago

If you encountered CUDA out of memory, you should decrease the sequence length or batch size.

Yes you are right. But in my case, max len 256 for input is the main requirement so I cannot decrease it. For the batch size, I tried with 1 similar pair per batch but It still got the issue

pick a smaller model

gaotianyu1350 commented 2 years ago

@TrieuLe0801 Yeah you should consider using a smaller model or a larger GPU.