Upadate to 0.2.1: support different behaviours in query and document tokenization

jingtaozhan / RepCONC

WSDM'22 Best Paper: Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval

MIT License

115 stars 13 forks source link

Upadate to 0.2.1: support different behaviours in query and document tokenization #7

Closed jingtaozhan closed 2 years ago

jingtaozhan commented 2 years ago

Some dense retrieval models use different tokenization methods for queries and documents, such as TCT-ColBERT. To support these models, repconc detects whether the tokenizer.call has an argument named 'input_text_type'. If it has, the type will be set to 'query' or 'doc'. Therefore, the tokenizer can know whether the input is query or document and has customized behavior.