ngxson / wllama

WebAssembly binding for llama.cpp - Enabling in-browser LLM inference
https://huggingface.co/spaces/ngxson/wllama
MIT License
441 stars 21 forks source link

How would you implement RAG / Document chat? #36

Closed flatsiedatsie closed 6 months ago

flatsiedatsie commented 6 months ago

In your readme you mention:

Maybe doing a full RAG-in-browser example using tinyllama?

I've been looking into a way to allow users to 'chat with their documents'. A popular concept. Specifically I was looking into 'Fully local PDF chatbot'. It seems.. complicated.

So I was wondering: if one wanted to implement this feature using Wllama, what are the 'components' of such a solution?

Would it be something like...

What would the steps actually be?

ngxson commented 6 months ago

A classic RAG system consist of a vector database + a generative model. With wllama, this can be archived by:

Another idea that is only possible if your document is short and predefined, is to construct a session and reuse it later (via sessionSave and sessionLoad) - This is useful in my case for example, if the chatbot is purely to introduce a specific website, we don't even need to make a vector database or to have embeddings at all. The downside is that this is not practical for any other usages.

felladrin commented 6 months ago

For a small embedding model good for this case, I can recommend this one: sentence-transformers/multi-qa-MiniLM-L6-cos-v1 (GGUF)

flatsiedatsie commented 6 months ago

Getting there...

Screenshot 2024-05-21 at 14 01 51

Currently using Transformers.js because I could find easy to copy examples:

extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
            quantized: false,
            progress_callback: data => {
                self.postMessage({
                    type: 'embedding_progress',
                    data
                });
            }

        });

        embeddings =  await extractor(texts, { pooling: 'mean', normalize: true });

I've alse seen mention of this model for embedding: nomic-ai/nomic-embed-text-v1? But for now.. it works.

Next: get an LLM to summarize the chunks.

ngxson commented 6 months ago

Ah nice. I tried nomic-embed-text before but it doesn't work very well. But maybe because I used Albert Einstein wiki page as the example, which is a very hard one.

Maybe you can give it a try?

Some questions that I tried but no success:

flatsiedatsie commented 6 months ago

Some questions that I tried but no success: Does he play guitar?

Did you let the LLM re-formulate the prompt first? In my project I just added the step to do that by looking at the conversation history first and rewriting the user's prompt to be explicit. So "he" becomes "Albert Einstein. It seems to work.

In fact it's all now working. Although the answer in this case seems almost too good to be solely based on the retrieved chunks..

Screenshot 2024-05-27 at 08 40 37