sasha0552 / vllm-ci

CI scripts designed to build a Pascal-compatible version of vLLM.
MIT License
8 stars 1 forks source link

Report on aphrodite-engine #5

Open the-crypt-keeper opened 1 month ago

the-crypt-keeper commented 1 month ago

vLLM has a cousin named aphrodite which adds EXL2, GGUF and a bunch of other handy features to it.

The good news is that the wheels already work out of the box on both P100 and P40 so no patching is needed there. EXL2 with batching on P100 is a creative writing beast, and GGUF on P40 with batching is 2-3x faster then without.

The bad news is that if you try to run it naively the "Triton doesn't care about Pascal" bugs will bite you. After a few calls the engine would simply hang on one of my P100 consuming 100% GPU and never completing the API call. It's difficult to even terminate the process.

Installing the patched Triton from this repo appears to both improve performance almost 1.5x and fix the hangs, at least I have not had any problem since the flip. Wondering if this is worth documenting in case others have similar aspirations of batching with SOTA quants on pascal GPUs.

sasha0552 commented 1 month ago

Thanks, added to README.md. Maybe we should notify aphrodite-engine developers to add this repository to their documentation?