snexus / llm-search

Querying local documents, powered by LLM
MIT License
479 stars 60 forks source link

Can we return source document with LLM response? #94

Closed mohammad-yousuf closed 7 months ago

mohammad-yousuf commented 7 months ago

Hi,

Is it possible to return specific source documents with LLM response? Is it possible to change re-rank model to bge-reranker-large?

snexus commented 7 months ago

Hi,

The LLM response should contain a reference to the sources on best effort basis, but it is very hard to enforce (depending on the LLM, e.g. ChatGPT is more consistent in that matter)

Is it possible to change re-rank model to bge-reranker-large?

The BGE type is hard-coded at the moment to the base but other variations can be supported. Keep in mind that bge-reranker-large is a much larger model and it will impact severely the speed of the retrieval. From empiric evaluations, it doesn't provide much advantage over the smaller models but sacrifices a lot of speed.

If you want to play around - it can be changed here https://github.com/snexus/llm-search/blob/fc69a69f504459ff64d59fb85696b46d640611e7/src/llmsearch/ranking.py#L35

mohammad-yousuf commented 7 months ago

Thank you so much @snexus. Do you think I should change the models because it cannot find some of the information at all.

I am using:

My data is large PDF files that contain images, charts, figures, and tables alongside text.

snexus commented 7 months ago

Don't think the problem is in the embedding or reranker.

PDFs are tricky and depend on parsing quality . I am not aware of open-source solutions that can do quality parse of pdf with figures and tables. This package extracts and parses the text from PDF, and if your queries are based on that text, it should find it without a problem. If you expect to retrieve information from tables and graphs within the PDF, afraid it won't work very well.

I am using this package to query a large collection of technical pdfs, and it works OK, but nothing compared to say querying a collection of markdown files (which are much easier to parse).

To try and troubleshoot: 1) Can you provide more information about the nature of pdfs and what type of questions are you asking? 2) In the output, can you check if provided context information to LLM (after the retrieval and reranker steps) is relevant to your query and you can infer the answer from it? If yes, then the problem is in LLM, it can't synthesize the answer properly from the provided context (in which case you should try a different LLM) 3) Make sure you have HyDE and Multiquery turned off.

mohammad-yousuf commented 7 months ago

My dataset of PDFs are regulations of different countries. They are very large PDFs (which I can't even go through) but I have seen they contain data in pie charts etc and specifically numbers which are important.

Sample questions:

Now, after you mentioned, I went through the context provided after reranker steps. You are 100% right. The embedding model and re-ranker are doing great job and the context does contain the exact information that is necessary. But the output doesn't. So the problem must be LLM.

Man, you are great.

If you can think of any additional steps for me (considering my data), I would be grateful.

snexus commented 7 months ago

In that case, there are a few things you can try:

After you sort out the model (by evaluating it on simpler questions), you can try to play with HyDE or Multiquery which might assist you answering more generic questions, like "how many regulations deal with aviation fuel produced from biomass? "

A question like "How is the report X is related to report Y?" is too generic for RAG in my opinion - for that system would need to have a summary of both reports in order to compare them. It might be able to retrieve context though when report X is mentioned in report Y or vice versa.

mohammad-yousuf commented 7 months ago

@snexus I am using:

It is a restriction for me to use open source models. Should I try other Mistral variants which are not instructs, may be that resolve the issue? I went for Mistral and not Llama 2 because of the long context window.

I have already tried HyDE and Multiquery, sometimes results are great, other times, they are not so good as compared to when they are turned off.

snexus commented 7 months ago

Instruct is the right variant for RAG. Are you able to try other models (without an increased context window) first and check if it resolve the issue?

Here is a nice comparison from a few months ago - some of them even have increased context window - https://www.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_comparisontest_2x_34b_yi_dolphin_nous/

You should be able to load any of these models with your hardware...

mohammad-yousuf commented 7 months ago

I will surely look into it and change my LLM. Thanks a lot @snexus.

Edit: I converted PDFs to HTML and I feel like it works better.