Allow to set the prefixes for stanford-nlp models

NohTow commented 1 month ago

After discussions with @bwanglzu, I realized the output of jina-colbert-v2 were not identical to the ones using stanford-nlp.

The problem was two-folds:

As it is a stanford-nlp repository, the prefixes used were [unused0] and [unused1], whereas they actually use [QueryMarker] and [DocumentMarker]. As this parameter is not directly readable from the repositories, my proposed solution is to let the user define the prefixes when loading the model. The PR had the ability to set the prefixes for stanford-nlp models and only default to unused if not set. It still default to [Q] and [D] if not set and not a stanford repo.
They actually attend to expansion tokens when encoding queries. As the functionality is already available in PyLate, the user just has to set attend_to_expansion_tokens to True. I do not have a way to read this from the repository either. These parameters are stored in the PyLate configurations when saving the model though.

Thus, the loading of Jina-colbert-v2 looks like this:

model = models.ColBERT(
    model_name_or_path="jinaai/jina-colbert-v2",
    query_prefix="[QueryMarker]",
    document_prefix="[DocumentMarker]",
    attend_to_expansion_tokens=True,
    trust_remote_code=True,
)

bwanglzu commented 1 month ago

DDE9E3BE-CA7B-4142-82BB-98192068D088

now i'm on the branch still get a bit different result (while the embeddings are close), might be related to precision

bwanglzu commented 1 month ago

the mixed precision manager brought the minor diff, disable it brings identical result. LGTM!

NohTow commented 1 month ago

The outputs are equivalent to RAGatouille's encode_index_free_queries and encode_index_free_documents so I think we are good. You can also use model_kwargs={"torch_dtype": torch.float16} to handle mixed precision in PyLate.

bwanglzu commented 1 month ago

quick follow up on my end (just keep transparency), PR of sample usage in jina-colbert repo: https://huggingface.co/jinaai/jina-colbert-v2/discussions/8

raphaelsty commented 1 month ago

@NohTow Would it be possible to update the models documentation and add a note on how to load Jina model ?

Otherwise everything look good to me, great to see we support Jina model

model = models.ColBERT(
      model_name_or_path="jinaai/jina-colbert-v2",
      query_prefix="[QueryMarker]",
      document_prefix="[DocumentMarker]",
      attend_to_expansion_tokens=True,
      trust_remote_code=True,
)

NohTow commented 1 month ago

I added a tip saying that we handle nlp-stanford models and added documentation for the Jina model (and added it to the BEIR tab aswell).

lightonai / pylate

Allow to set the prefixes for stanford-nlp models #55