Report on aphrodite-engine

vLLM has a cousin named aphrodite which adds EXL2, GGUF and a bunch of other handy features to it.

The good news is that the wheels already work out of the box on both P100 and P40 so no patching is needed there. EXL2 with batching on P100 is a creative writing beast, and GGUF on P40 with batching is 2-3x faster then without.

The bad news is that if you try to run it naively the "Triton doesn't care about Pascal" bugs will bite you. After a few calls the engine would simply hang on one of my P100 consuming 100% GPU and never completing the API call. It's difficult to even terminate the process.

Installing the patched Triton from this repo appears to both improve performance almost 1.5x and fix the hangs, at least I have not had any problem since the flip. Wondering if this is worth documenting in case others have similar aspirations of batching with SOTA quants on pascal GPUs.

sasha0552 / vllm-ci

Report on aphrodite-engine #5