Closed arainey2022 closed 1 year ago
hi @arainey2022
The largest text chunk you can ingest is determined by the Language Model max input length (e.g., 512 subwords for a SPLADE based on BERT).
There is not really an "optimal" size; most of our models have been trained with a max input length of 256, but extending the window to 512 at test time usually works fine.
This is really helpful, thank you, @thibault-formal!
And so I'm not being silly, sub-word limit will be like the token length limit? So we can split chunks to max token length 512.
Also, when you say "at test time" what do you mean please? As in not suitable for production?
Thanks, again!
And so I'm not being silly, sub-word limit will be like the token length limit? So we can split chunks to max token length 512.
Yes, that's right.
Also, when you say "at test time" what do you mean please? As in not suitable for production?
Test time means after training here!
Thank you very much!
Hi there, I can't seem to find this documented, but is there a maximum or optimal text chunk size when creating sparse embeddings?
Thank you!