turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.22k stars 238 forks source link

Integration with llamaindex for RAG #261

Open mirix opened 6 months ago

mirix commented 6 months ago

Hello,

I am interested in integrating exl2-quantised models with llamaindex for RAG.

Are you aware of any Python examples I could follow?

I am currently using GGUF models.

I have come across this:

https://github.com/chu-tianxiang/exl2-for-all

If it works, it should be straightforward to adapt my current scripts.

But, if someone has already done something similar, I'd rather not reinvent the wheel.

I see there is a exllamav2_hf module. Is it documented anywhere?

Best,

Ed

turboderp commented 6 months ago

I'm not aware of any exllamav2_hf module except for the one in text-generation-webui. It isn't quite a HF-compatible wrapper for ExLlamaV2, but it does plug in HF samplers, so maybe it could be a starting point for what you're doing.

I haven't heard of exl2-for-all before, either. I'll give it a look in a moment.

Mostly, though, ExLlamaV2's forward pass is fairly similar to that of a HF model. You pass it input token IDs and it outputs logits. For RAG I imagine you want output embeddings. You could get those with the return_last_state argument to the forward function, but I can't really say more without knowing what sort of output your current script needs for embeddings.

mirix commented 6 months ago

Thanks. exl2-for-all seems to work fine for inference and enables reusing HF-like snippets with very few lines of code.

Unfortunately, it is not compatible with Llama Index' HuggingFaceLLM (or any other API).

I wonder how hard it would be extending their Exl2ForCausalLM to be more in line with HF's AutoModelForCausalLM.

It would definitively increase the adoption of exl2.

Unfortunately, I am not a developer.

mirix commented 6 months ago

It works with exl2-for-all.

I just needed to replace the last line of model.py with

return model

There was an additional wrapper there breaking compatibility.

It uses just one GPU and it is much slower than AWQ, so I need to figure a couple of things out, but it is working!

The testing script is here:

https://github.com/mirix/retrieval-augmented-generation/blob/main/rag_llama_index_bm25_exl2_app.py

edwardsmith999 commented 2 months ago

Hi @mirix , thanks for this. I'm interested in using exl2 for RAG but the code example is missing the hacked model.py (which doesn't have Exl2ForCausalLM explicitly. It seems load_quantized_model returns this but I get a string of errors trying to use this with the exl2 file I have.