lightonai / pylate

Late Interaction Models Training & Retrieval
https://lightonai.github.io/pylate/
MIT License
158 stars 7 forks source link

Fix tokenization for query/doc marker #11

Closed NohTow closed 2 months ago

NohTow commented 4 months ago

As written in the code, the way to add the marker is not robust at all. This PR from the official repository propose something a bit more robust, I'll make sure it works fine and add it to the project.

NohTow commented 4 months ago

I tried the code, but as I feared (because I tried this during building), the results are not ok because the marker tokens needs to be tokenized in isolation.

We still need to find a more robust solution than what we have because it will certainly break for other tokenizer/markers.