Exllamav2 inference with EXL Quants

triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

BSD 3-Clause "New" or "Revised" License

8.34k stars 1.48k forks source link

Open rjmehta1993 opened 3 months ago

rjmehta1993 commented 3 months ago

Do you support Exllamav2 backend for the inference that supports exl quants?

The current alternative is vllm but that doesn't support EXL quants. Also, after running a perplexity test, EXL is the best.

Transformers supports exllamav2 backend. But that has a very poor tokens/sec throughput.

oandreeva-nv commented 3 months ago

I believe this can be deployed through custom python-based backend in a similar way we have a vllm backend: https://github.com/triton-inference-server/vllm_backend/blob/main/src/model.py