Model request: add support for a small RAG model

flatsiedatsie commented 1 month ago

I've been implementing RAG my current project, which is 100% browser based.

While doing so I learnt that there are specialized models for RAG that are fine-tuned to only generate from the data that it is given from the RAG search. The concept is to avoid it making up information that isn't actually in the documents. This is vital in many fields, such as medicine or law, where a user may want to get a summary from a number of documents, and be somewhat certain that the summary contains only information from the source texts.

I was wondering if WebLLM could offer support for one of these specialized RAG models.

Since this model doesn't need to contain any information about the world, and should only be good at summarizing what it's given, such a model can be relatively small. Which means that the model should be able to run speedily on a lot of hardware, making this useful to a wide range of users. E.g. medical students with 8GB ram in their Macbook Airs.

I've been trying to find a RAG model to recommend for this, but that has been surprisingly tricky. For example, this Reddit discussion from 3 months ago (one of many) recognizes the same parameters for the model, but it's not clear which model fits the bill.

I'll keep looking, and post a suggestion if I find one. But perhaps this is an area where the WebLLM team already has experience.

// Earlier discussion (and screenshots) can be found here: https://github.com/ngxson/wllama/issues/36

flatsiedatsie commented 1 month ago

The "Bling" version of Phi 3 seems to be somewhat popular for this use case. It has an Apache license. https://huggingface.co/llmware/bling-phi-3-gguf

There is also an older Phi 2 version.

There are lots of Mistral 7 models, but such a large model would seem to defeat the point a bit. Also, the Phi 3 version scores just as well as the Mistral 7B version on Llmware's benchmark. A benchmark they developer themselves, but still.

Another Phi 3 option seems to be https://huggingface.co/TroyDoesAI/Phi-3-Context-Obedient-RAG It has a different license: CC BY 4.0

I suspect the "Bling" version of Phi 3 would be fine? And since it's Phi 3 there is already support for it in WebLLM.

I'm testing that model now and will let you know about any insights I gain into it's quality.

Jaifroid commented 3 weeks ago

@flatsiedatsie This is something I'd also be interested in. Would you be aiming to inference using pre-caclulated vector embeddings, and if so, what would you use for calculating them?

flatsiedatsie commented 3 weeks ago

@Jaifroid Yes. I'm currently using Transformers.js to create the embeddings, since it has GPU acceleration support. You can find examples on the Transformers.js Github. Here's my code:

        if(self.extractor == null){

            let options = {
                quantized: false,
                progress_callback: data => {
                    //console.log("RAG WORKER: embedding progress: ", data);
                    self.postMessage(data);
                }

            }
            if(web_gpu_supported){
                options['device'] = 'webgpu';
            }
            self.extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', options);
        }

        console.log("RAG WORKER: creating generating embeddings");
        embeddings = await self.extractor(texts, { pooling: 'mean', normalize: true });

mlc-ai / web-llm

Model request: add support for a small RAG model #445