nico-martin / ask-my-pdf

A Webapp that uses Retrieval Augmented Generation (RAG) and Large Language Models to interact with a PDF directly in the browser.
https://pdf.nico.dev
MIT License
58 stars 5 forks source link

Bugs / Clarifications #1

Open swissspidy opened 5 months ago

swissspidy commented 5 months ago

Heya,

So I thought this would be an interesting project to leverage the new built-in AI capabilities in Chrome, so I forked the repo and started tinkering with it.

I had a few observations/questions about the RAG part.

First, it is sending an extra curly brace here:

https://github.com/nico-martin/ask-my-pdf/blob/cf3e36d8ccad765509e88acac801d6b35dd308e3/src/store/ragContext/RagContextProvider.tsx#L99

Second, here it is potentially adding the same lines multiple times:

https://github.com/nico-martin/ask-my-pdf/blob/cf3e36d8ccad765509e88acac801d6b35dd308e3/src/store/ragContext/RagContextProvider.tsx#L71-L94

So this ends up being bad input for the LLM as it's just the same content repeated over and over again.

With some console.log() here and there this was easy to spot.

Here's what I did to fix duplication:

    const foundEntries: Array<string> = [];
    results.map((result) => {
      let entry = '';
      [...Array(7).keys()].forEach((i) => {
        const line = entries.find(
          (entry) =>
            entry.metadata.allLinesNumber ===
            result[0].metadata.allLinesNumber + (i - 3)
        );
        if (line) {
          if ( ! activeLines.includes(line.metadata.allLinesNumber) && ! fuzzyLines.includes(line.metadata.allLinesNumber)) {
            entry += `${line.str} `;
          }
          if (i - 3 === 0) {
            activeLines.push(line.metadata.allLinesNumber);
          } else {
            fuzzyLines.push(line.metadata.allLinesNumber);
          }
        }
      });
      if ( entry ) {
        foundEntries.push(entry);
      }
    });
    setActiveLines({ exact: activeLines, fuzzy: fuzzyLines });

    let prompt = `These are parts of the ${pdfTitle}:\n\n`;

    // TODO: Test with allEntries or foundEntries
    foundEntries.forEach((result) => {
      prompt += `"${result}"`;
      prompt += '\n\n';
    });

Third, I am struggling to see the need for the whole vector DB here. In my testing, it leads to poor results due to information loss.

Example:

  1. I upload a PDF of a restaurant's menu. It's a number of dishes with some descriptions and a price.
  2. I ask "What is the most expensive item on the menu?"
  3. This searches the vector DB for this query and somehow returns maybe 1/3rd of the PDF text for feeding into the LLM prompt
  4. The LLM now gets 1/3rd of the PDF content as context for the prompt.
  5. You won't get a correct result due to the information loss
nico-martin commented 5 months ago

Hi @swissspidy,

I really appreciate you taking the time to take a closer look at my project!

  1. Yep, thats a bug. Will fix that
  2. Hmm.. You're right. I will try that
  3. Ok, that will be a longer answer :)

The thing is that for small PDFs I would agree that the whole vectorization is not necessary. It would be possible to just send the whole content. But you would lose the link to the sources. With RAG I determine which sources are relevant for answering the questions. If I simply send the entire content, it is not clear which passages the answer is based on.

Also as soon as we work with bigger files (I am working on that) we will quickly exceed the 8k context window of the Gemma 2B. So we need to boil it down to the relevant parts.

One limitation I found is that the way it is set up it works best with long paragraphs of continuous text. Thats what the all-MiniLM-L6-v2, which I use for similarity search, was trained for. Especially with tables, it quickly loses the right context. In your case with the menu (not much text to process, not suitable for sentence transformer similarity search) I would agree that just sending the whole menu might provide better results.

I was thinking more of long documents such as terms and conditions or contracts, which have long flowing texts and clear chapters. Like the "twitter terms of service" I use in my example here: https://www.linkedin.com/feed/update/urn:li:activity:7203439376545054720/

swissspidy commented 5 months ago

Yeah I definitely get the reason behind it and there is a need to stay within the limits of the system.

I think my issue is that, while it works for some prompts, it is not suitable for others, a) because of the loss of information and b) because my question for Gamma does not necessarily make for a good query for the vector db.

Even for a text-heavy document such as the one from https://mozilla.github.io/pdf.js/web/viewer.html, I can't ask questions such as "What is the title of this document?" or "How many words are in this document?". (Just some examples) On the other hand, a query like "What is SpiderMonkey?" works better because the db search gives some better results already due to a similar sentence existing in the document.

nico-martin commented 5 months ago

I am actually surprised how good it works with your example document. At least for questions that are dealt with in the content of the document. It has a few problems with meta questions about the document. In my opinion "Ask my PDF" has two main problems:

  1. The parsing of the PDF is not yet good enough. Parsing line by line means that we lose the document structure (coherent paragraphs, headings, etc.) and since sentences are separated if they go over two lines, the search does not work optimally either.
  2. The generated prompt could provide more meta informations about the document to also be able to anwer questions about the document.

Unfortunately, I don't yet have a good idea for the first issue in particular. To be honest, I've seen this problem coming since I started the project :/

nico-martin commented 5 months ago

After some second and third thoughts I might need to change to a different approach. Right now, every question is handled as if RAG would be the best solution. But as in your example, quite often its not. So maybe I should try a "function calling" approach first where I analyse the question and try to figure out how to proceed from there on.

I guess that could be a better approach. But I still think I need to solve the structured content problem first😕

swissspidy commented 5 months ago

Cool, it sounds like we're on the same page :)