luyug / Condenser

EMNLP 2021 - Pre-training architectures for dense retrieval
Apache License 2.0
245 stars 23 forks source link

Regarding the spans in the contrastive loss calculation #5

Open hyleemindslab opened 2 years ago

hyleemindslab commented 2 years ago

Hello,

In the paper it is stated that

... given a random list of n documents [d1, d2, ..., dn], we extract randomly from each a pair of spans, [s11, s12, ..., sn1, sn2].

I was wondering how the spans were extracted from a document. Are they sentences, each of which is split by nltk.sentence_tokenizer? Or, are they equally sized chunks extracted using a sliding window? Maybe they are the same as the Condenser pretraining blocks but annotated with a document id to which they belong?

Thank you.

luyug commented 2 years ago

I used random non overlapping sequences. Technically what is desired here is a good passage tokenizer; you may get better performance if you can do a better job separating out the passages.

eugene-yang commented 2 years ago

@luyug I'm wondering how long are these spans? From what I understand, you were using $MAX_LENGTH in the scripts for setting the length. Can you share the values you used when training the models?

luyug commented 2 years ago

The length should be selected to align roughly with text lengths in your actual search task's (rounded according to your accelerator's requirement). For passage retrieval, we used 128.