luyug / COIL

NAACL2021 - COIL Contextualized Lexical Retriever
Apache License 2.0
148 stars 28 forks source link

How is document expansion helpful if p_max_len=192 in unicoil training and encoding command? Most MSMARCO passages are over 192 tokens #16

Open nirmal2k opened 2 years ago

nirmal2k commented 2 years ago

How is corpus-d2q is prepared? On what p_max_len is castorini/unicoil-d2q-msmarco-passage trained? Can I use p_max_len as 512 and encode using it?

MXueguang commented 2 years ago

Hi @nirmal2k, yes you can use p_man_len as 512 and encode using it. castorini/unicoil-d2q-msmarco-passage is trained with p_max_len 192.

corpus-d2q contains original msmarco-passage text token+ [SEP] + new tokens generated from doc2query