Mismatch in documents embeddings size

raphaelsty / neural-cherche

Neural Search

https://raphaelsty.github.io/neural-cherche/

MIT License

347 stars 17 forks source link

Mismatch in documents embeddings size #4

Closed iknoorjobs closed 11 months ago

iknoorjobs commented 12 months ago

Hi,

Thanks for the great work on the repo.

While attempting to encode documents using my custom dataset, I've encountered a discrepancy in the number of input docs and the number of embedding produced. For eg.

ranker_documents_embeddings = ranker.encode_documents(
    documents=documents, # total: 30452
    batch_size=batch_size,
)
print(len(ranker_documents_embeddings), len(documents)) #prints 30427 30452

Do you have any insight into what might be causing this issue?

Thanks

raphaelsty commented 12 months ago

Hi @iknoorjobs, you might have duplicates ids in your set of documents. Since the output of encode_documents is a dict document_id: embedding, it drop duplicates.

You should avoid duplicates in your documents and in your queries