Open swissspidy opened 5 months ago
Hi @swissspidy,
I really appreciate you taking the time to take a closer look at my project!
The thing is that for small PDFs I would agree that the whole vectorization is not necessary. It would be possible to just send the whole content. But you would lose the link to the sources. With RAG I determine which sources are relevant for answering the questions. If I simply send the entire content, it is not clear which passages the answer is based on.
Also as soon as we work with bigger files (I am working on that) we will quickly exceed the 8k context window of the Gemma 2B. So we need to boil it down to the relevant parts.
One limitation I found is that the way it is set up it works best with long paragraphs of continuous text. Thats what the all-MiniLM-L6-v2, which I use for similarity search, was trained for. Especially with tables, it quickly loses the right context. In your case with the menu (not much text to process, not suitable for sentence transformer similarity search) I would agree that just sending the whole menu might provide better results.
I was thinking more of long documents such as terms and conditions or contracts, which have long flowing texts and clear chapters. Like the "twitter terms of service" I use in my example here: https://www.linkedin.com/feed/update/urn:li:activity:7203439376545054720/
Yeah I definitely get the reason behind it and there is a need to stay within the limits of the system.
I think my issue is that, while it works for some prompts, it is not suitable for others, a) because of the loss of information and b) because my question for Gamma does not necessarily make for a good query for the vector db.
Even for a text-heavy document such as the one from https://mozilla.github.io/pdf.js/web/viewer.html, I can't ask questions such as "What is the title of this document?" or "How many words are in this document?". (Just some examples) On the other hand, a query like "What is SpiderMonkey?" works better because the db search gives some better results already due to a similar sentence existing in the document.
I am actually surprised how good it works with your example document. At least for questions that are dealt with in the content of the document. It has a few problems with meta questions about the document. In my opinion "Ask my PDF" has two main problems:
Unfortunately, I don't yet have a good idea for the first issue in particular. To be honest, I've seen this problem coming since I started the project :/
After some second and third thoughts I might need to change to a different approach. Right now, every question is handled as if RAG would be the best solution. But as in your example, quite often its not. So maybe I should try a "function calling" approach first where I analyse the question and try to figure out how to proceed from there on.
I guess that could be a better approach. But I still think I need to solve the structured content problem first😕
Cool, it sounds like we're on the same page :)
Heya,
So I thought this would be an interesting project to leverage the new built-in AI capabilities in Chrome, so I forked the repo and started tinkering with it.
I had a few observations/questions about the RAG part.
First, it is sending an extra curly brace here:
https://github.com/nico-martin/ask-my-pdf/blob/cf3e36d8ccad765509e88acac801d6b35dd308e3/src/store/ragContext/RagContextProvider.tsx#L99
Second, here it is potentially adding the same lines multiple times:
https://github.com/nico-martin/ask-my-pdf/blob/cf3e36d8ccad765509e88acac801d6b35dd308e3/src/store/ragContext/RagContextProvider.tsx#L71-L94
So this ends up being bad input for the LLM as it's just the same content repeated over and over again.
With some
console.log()
here and there this was easy to spot.Here's what I did to fix duplication:
Third, I am struggling to see the need for the whole vector DB here. In my testing, it leads to poor results due to information loss.
Example: