triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.34k stars 1.48k forks source link

Exllamav2 inference with EXL Quants #7477

Open rjmehta1993 opened 3 months ago

rjmehta1993 commented 3 months ago

Do you support Exllamav2 backend for the inference that supports exl quants?

The current alternative is vllm but that doesn't support EXL quants. Also, after running a perplexity test, EXL is the best.

Transformers supports exllamav2 backend. But that has a very poor tokens/sec throughput.

oandreeva-nv commented 3 months ago

I believe this can be deployed through custom python-based backend in a similar way we have a vllm backend: https://github.com/triton-inference-server/vllm_backend/blob/main/src/model.py