naver / splade

SPLADE: sparse neural search (SIGIR21, SIGIR22)
Other
710 stars 79 forks source link

Chunk token limit for SPLADE sparse embeddings? #39

Closed arainey2022 closed 1 year ago

arainey2022 commented 1 year ago

Hi there, I can't seem to find this documented, but is there a maximum or optimal text chunk size when creating sparse embeddings?

Thank you!

thibault-formal commented 1 year ago

hi @arainey2022

The largest text chunk you can ingest is determined by the Language Model max input length (e.g., 512 subwords for a SPLADE based on BERT).

There is not really an "optimal" size; most of our models have been trained with a max input length of 256, but extending the window to 512 at test time usually works fine.

arainey2022 commented 1 year ago

This is really helpful, thank you, @thibault-formal!

And so I'm not being silly, sub-word limit will be like the token length limit? So we can split chunks to max token length 512.

Also, when you say "at test time" what do you mean please? As in not suitable for production?

Thanks, again!

thibault-formal commented 1 year ago

And so I'm not being silly, sub-word limit will be like the token length limit? So we can split chunks to max token length 512.

Yes, that's right.

Also, when you say "at test time" what do you mean please? As in not suitable for production?

Test time means after training here!

arainey2022 commented 1 year ago

Thank you very much!