Open mirix opened 6 months ago
I'm not aware of any exllamav2_hf
module except for the one in text-generation-webui. It isn't quite a HF-compatible wrapper for ExLlamaV2, but it does plug in HF samplers, so maybe it could be a starting point for what you're doing.
I haven't heard of exl2-for-all before, either. I'll give it a look in a moment.
Mostly, though, ExLlamaV2's forward pass is fairly similar to that of a HF model. You pass it input token IDs and it outputs logits. For RAG I imagine you want output embeddings. You could get those with the return_last_state
argument to the forward function, but I can't really say more without knowing what sort of output your current script needs for embeddings.
Thanks. exl2-for-all seems to work fine for inference and enables reusing HF-like snippets with very few lines of code.
Unfortunately, it is not compatible with Llama Index' HuggingFaceLLM (or any other API).
I wonder how hard it would be extending their Exl2ForCausalLM to be more in line with HF's AutoModelForCausalLM.
It would definitively increase the adoption of exl2.
Unfortunately, I am not a developer.
It works with exl2-for-all.
I just needed to replace the last line of model.py with
return model
There was an additional wrapper there breaking compatibility.
It uses just one GPU and it is much slower than AWQ, so I need to figure a couple of things out, but it is working!
The testing script is here:
https://github.com/mirix/retrieval-augmented-generation/blob/main/rag_llama_index_bm25_exl2_app.py
Hi @mirix , thanks for this. I'm interested in using exl2 for RAG but the code example is missing the hacked model.py
(which doesn't have Exl2ForCausalLM
explicitly. It seems load_quantized_model
returns this but I get a string of errors trying to use this with the exl2 file I have.
Hello,
I am interested in integrating exl2-quantised models with llamaindex for RAG.
Are you aware of any Python examples I could follow?
I am currently using GGUF models.
I have come across this:
https://github.com/chu-tianxiang/exl2-for-all
If it works, it should be straightforward to adapt my current scripts.
But, if someone has already done something similar, I'd rather not reinvent the wheel.
I see there is a exllamav2_hf module. Is it documented anywhere?
Best,
Ed