marella / chatdocs

Chat with your documents offline using AI.
MIT License
683 stars 97 forks source link

`score_threshold` in db.as_retriever doesn't seem to be enforced #80

Closed drvenabili closed 11 months ago

drvenabili commented 11 months ago

Hi,

I'm trying to prevent the model to return documents that are not relevant. To do so, I wanted to change the default search_type param in db.as_retriever to "similarity_score_threshold. According to the langchain docs, combined to search_kwargs={"score_threshold": 0.000000005}, it should limit returned documents to only those that have a similarity score of at least 0.000000005 : https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore#similarity-score-threshold-retrieval . I specifically put a tiny score to make sure no results are returned.

This does not seem to be the case, as it returns k documents.

I'm not sure how to proceed to try and fix this, any idea?

Below are some more information showing that the retriever does indeed get the correct config, and that it's correctly sent forward to RetrievalQA:

in chatdocs.yml:

retriever:
  search_type: similarity_score_threshold
  search_kwargs:
    k: 3
    score_threshold: 0.000000005

in chains.py:

retriever = db.as_retriever(**config["retriever"])
pprint(retriever)
VectorStoreRetriever(
        tags=['Chroma', 'HuggingFaceInstructEmbeddings'], 
        metadata=None, 
        vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7f7362e66bf0>, 
        search_type='similarity_score_threshold', 
        search_kwargs={'k': 3, 'score_threshold': 5e-09}
)

in chat.py:

qa = get_retrieval_qa(config, callback=print_answer)
pprint(qa)
RetrievalQA(
        (...) 
        input_key='query', output_key='result', 
        return_source_documents=True, 
        retriever=VectorStoreRetriever(
                tags=['Chroma', 'HuggingFaceInstructEmbeddings'], 
                metadata=None, 
                vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7f7362e66bf0>, 
                search_type='similarity_score_threshold', 
                search_kwargs={'k': 3, 'score_threshold': 5e-09}))

Despite this and a garbage input, in this case the chat returns k documents (here below, k == 3). This also happens when I don't specify a k, as it defaults to 4. Am I missing something? Thanks

Q: fkuefkuefhef
A: I don't know the answer to that question.
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ examples/documents/state_of_the_union.txt                                                                                                       │
│                                                                                                                                                 │
│ So what are we waiting for? Let’s get this done. And while you’re at it, confirm my nominees to the Federal Reserve, which plays a critical     │
│ role in fighting inflation.                                                                                                                     │
│                                                                                                                                                 │
│ My plan will not only lower costs to give families a fair shot, it will lower the deficit.                                                      │
│                                                                                                                                                 │
│ The previous Administration not only ballooned the deficit with tax cuts for the very wealthy and corporations, it undermined the watchdogs     │
│ whose job was to keep pandemic relief funds from being wasted.                                                                                  │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ examples/documents/state_of_the_union.txt                                                                                                       │
│                                                                                                                                                 │
│ My administration is providing assistance with job training and housing, and now helping lower-income veterans get VA care debt-free.           │
│                                                                                                                                                 │
│ Our troops in Iraq and Afghanistan faced many dangers.                                                                                          │
│                                                                                                                                                 │
│ One was stationed at bases and breathing in toxic smoke from “burn pits” that incinerated wastes of war—medical and hazard material, jet fuel,  │
│ and more.                                                                                                                                       │
│                                                                                                                                                 │
│ When they came home, many of the world’s fittest and best trained warriors were never the same.                                                 │
│                                                                                                                                                 │
│ Headaches. Numbness. Dizziness.                                                                                                                 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ examples/documents/state_of_the_union.txt                                                                                                       │
│                                                                                                                                                 │
│ Fourth, we will continue vaccinating the world.                                                                                                 │
│                                                                                                                                                 │
│ We’ve sent 475 Million vaccine doses to 112 countries, more than any other nation.                                                              │
│                                                                                                                                                 │
│ And we won’t stop.                                                                                                                              │
│                                                                                                                                                 │
│ We have lost so much to COVID-19. Time with one another. And worst of all, so much loss of life.                                                │
│                                                                                                                                                 │
│ Let’s use this moment to reset. Let’s stop looking at COVID-19 as a partisan dividing line and see it for what it is: A God-awful disease.      │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
drvenabili commented 11 months ago

There's no smart way of spinning this: I misunderstood the thresholding. The threshold value should be high (so that it does not return documents BELOW said value), not low.

I'm closing this non-issue...