Thanks for the great project, I have found an issue when encoding documents in different batches and then scoring them independently against different queries.
The mask method used inside the doc method of colBert does not consider mask tokens. (it could use mask tokens or take into account the attention_mask passed to the doc method.
Masking method in ColBert only handles skip and pad_tokens but does not handle mask_tokens. This leads to different relevance scores if the encoding has been done in batches together with other documents or not.
Hey team!
Thanks for the great project, I have found an issue when encoding
documents
in different batches and then scoring them independently against differentqueries
.The
mask
method used inside thedoc
method ofcolBert
does not considermask tokens
. (it could use mask tokens or take into account theattention_mask
passed to thedoc
method.Masking method in ColBert only handles
skip
andpad_tokens
but does not handlemask_tokens
. This leads to different relevance scores if the encoding has been done in batches together with other documents or not.