Open hyleemindslab opened 2 years ago
I used random non overlapping sequences. Technically what is desired here is a good passage tokenizer; you may get better performance if you can do a better job separating out the passages.
@luyug I'm wondering how long are these spans?
From what I understand, you were using $MAX_LENGTH
in the scripts for setting the length. Can you share the values you used when training the models?
The length should be selected to align roughly with text lengths in your actual search task's (rounded according to your accelerator's requirement). For passage retrieval, we used 128.
Hello,
In the paper it is stated that
I was wondering how the spans were extracted from a document. Are they sentences, each of which is split by nltk.sentence_tokenizer? Or, are they equally sized chunks extracted using a sliding window? Maybe they are the same as the Condenser pretraining blocks but annotated with a document id to which they belong?
Thank you.