stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.67k stars 355 forks source link

fix: fix masking to consider mask_token #314

Open JoanFM opened 4 months ago

JoanFM commented 4 months ago

Hey team!

Thanks for the great project, I have found an issue when encoding documents in different batches and then scoring them independently against different queries.

The mask method used inside the doc method of colBert does not consider mask tokens. (it could use mask tokens or take into account the attention_mask passed to the doc method.

Masking method in ColBert only handles skip and pad_tokens but does not handle mask_tokens. This leads to different relevance scores if the encoding has been done in batches together with other documents or not.

bwanglzu commented 4 months ago

@bclavie @okhat could you please take a look?