qdrant / fastembed

Fast, Accurate, Lightweight Python library to make State of the Art Embedding
https://qdrant.github.io/fastembed/
Apache License 2.0
1.35k stars 99 forks source link

Qdrant giving relatively high scores when doing embeddings with `BAAI/bge-small-en-v1.5` #74

Closed evanhameed99 closed 8 months ago

evanhameed99 commented 9 months ago

Hi,

I am generating embeddings for a lot of abstracts each 500 chars longs using fastembed and BAAI/bge-small-en-v1.5 and inserting them into Qdrant. I am using also the same above setup to embed user questions in my LLM. and then use the embedded vector to find relevant abstracts in qdrant.

the correlation that is happening between the embeddings are giving high scores. which does not make sense all the time.

Suppose the below example.

Q: Do you know any info about Atomic habits book by James Clear?

Retrieved abstract from db: Do you ever know the book? Yeah, this is a very interesting book to read. As we told previously, reading is not kind of obligation activity to do when we have to obligate. Reading should be a habit, a good habit. By reading, you can open the new world and get the power from the world. Everything can be gained through the book. Well in brief, book is very powerful. As what we offer you right here, this natural healing is as one of reading book for you. the score of this answer is 0.72

Q:Who was the creator of dragon ball Xenoverse game?

retireved abstract from db:

Abstract Computational algorithms can be described in many methods and implemented in many languages. Here we present an approach using storytelling methods of computer game design in modeling some finite-state machine algorithms and applications requiring user interaction. An open source software Twine is used for the task. Interactive nonlinear stories created with Twine are applications that can be executed in a web browser. Storytelling approach provides an easy-to-understand view on computational power. the score of this record is 0.6188309

I understand that there is very similarity between the questions asked above and their corresponding retrieved abstracts. they can be retrieved which is fine, but not with such scores. I would score these abstracts lower than 0.7 or 0.6.

This behavior caused me to increase the score_threshold of the database retrieval to 0.75 .

Is this not making sense only for me? or is that the normal behavior of how things should work.

Using sentence-transformers from huggingface previously, was giving lower scores for these kind of questions above. Using fastembed with BAAI/bge-base-en-v1.5 was giving high scores as well and also sentence-transformers/all-minilm-l6-v2 from fastembed was giving high scores too. Any ideas?

retrogtx commented 8 months ago

Longer or more frequent texts might have larger embedding vectors, which could result in higher dot products or cosine similarities. To decrease this effect maybe could try to normalize the vector lengths or apply some weighting scheme to balance the popularity of the texts..

generall commented 8 months ago

In most cases, absolute values of the similarity scores have very limited usability. We don't recommend to rely on it and suggest to use relative values instead.

generall commented 8 months ago

It seems the problem is not fastembed-specific (but rather model-specific). I will move it discussions if you don't mind