Closed sarfraznawaz2005 closed 2 months ago
Hi @sarfraznawaz2005, I don't see anything wrong from the info above. I'd double check that you're passing the expected text to the LLMs for both documents and queries. If the text is longer, you may want to try different chunking strategies. Also, I'd remove any vector indexes while debugging so exact search is used.
Edit: You may also want to add ->where('llm', ...)
to the query to ensure you're not comparing embeddings across different models (which would produce meaningless scores).
@ankane thanks for the reply.
I tried different chunk sizes, same result. Yes the results given by Document::query()
below are passed to llm and for query, embeddings are generated via llm and passed to same Document::query()
code below. For code below:
Document::query()
->select(['id', 'content', 'llm', 'metadata'])
->selectRaw("$field <-> ? AS score", [$queryEmbeddings])
->orderByRaw('score ASC') // in L2, lower is better
->limit(5)
->get()
The actual query becomes:
select "id", "content", "llm", "metadata", embedding_768 <-> '[-0.042995043,0.020061782,-0.012362629,0.037271198,0.014026179,0.052065067, until 768]' AS score from "documents" order by score ASC limit 5
In above query I passed gibberish text sdfsdfsdfsdfsdfdgrytytu567658yumhgj5674323rdfbfghgfhfg
and it still gave results as below:
For now, llm is always gemini
and that's only llm being used currently so I don't think adding that in where
clause would help.
I also tried without indexing, scores still don't make sense. I also verified correct embeddings are saved for each piece of text by generating embeddings for different texts and comparing with what is saved in db.
Is there some other way to modify query or some way so I can get correct scoring or filter based on score.
Here is full code if that helps: code
Add Docs to DB: https://github.com/sarfraznawaz2005/docchat/blob/main/src/Models/Document.php#L45
Fetch: https://github.com/sarfraznawaz2005/docchat/blob/main/src/Services/LLMUtilities.php#L103
The query will return the correct score based on the vectors and distance function.
It looks like you're using the SEMANTIC_SIMILARITY
task type instead of RETRIEVAL_DOCUMENT
and RETRIEVAL_QUERY
for Gemini, which could be the issue. https://ai.google.dev/gemini-api/tutorials/document_search
Please use other resources for additional help (as this isn't an issue with pgvector).
@ankane thanks for the reply. It seems issue is something else, I will figure out, thanks for the help.
Thanks for great package in the first place.
I am using L2 distance but issue I am running into is that even if i search for gibberish text, results are returned. Here is my migration:
Model:
Insert Code:
Fetch Code:
In order to avoid getting results for gibberish text, I tried adding
score
field above in order to filter based on score. The issue then is that even for non-existing text, I see score similar to those for matched records. Example:Notice the score for existing and non-existing texts.
I tried with cosine and others, but scores don't make sense.
Probably I am doing something wrong. The reason why I want to do this is because I want to avoid sending a query to lllm when records do not fall under certain threshold and so that I can get an opportunity to search via other methods if semantic search fails, I should be able to know beforehand. Right now, I don't get a chance to run fallback search methods because above fetch code always returns records even for text that is not there in any of documents. If I could get scoring right, it would be helpful.
Thanks for the help.