Tokenization Assumption for Query Marker Replacement is Inconsistent

stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)

MIT License

2.67k stars 355 forks source link

Tokenization Assumption for Query Marker Replacement is Inconsistent #346

Open sleep-ee opened 1 month ago

sleep-ee commented 1 month ago

In QueryTokenizer.tensorize() and DocTokenizer.tensorize(), appending '. ' to the beginning of each string assumes that the period character will always be tokenized into one ID. This is not always the case with tokenizers where extra superfluous IDs will be added. Specifically (in QueryTokenizer.tensorize()):

batch_text = ['. ' + x for x in batch_text]

and then

ids[:, 1] = self.Q_marker_token_id