Open rjmehta1993 opened 3 months ago
Do you support Exllamav2 backend for the inference that supports exl quants?
The current alternative is vllm but that doesn't support EXL quants. Also, after running a perplexity test, EXL is the best.
Transformers supports exllamav2 backend. But that has a very poor tokens/sec throughput.
I believe this can be deployed through custom python-based backend in a similar way we have a vllm backend: https://github.com/triton-inference-server/vllm_backend/blob/main/src/model.py
Do you support Exllamav2 backend for the inference that supports exl quants?
The current alternative is vllm but that doesn't support EXL quants. Also, after running a perplexity test, EXL is the best.
Transformers supports exllamav2 backend. But that has a very poor tokens/sec throughput.