pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
BSD 3-Clause "New" or "Revised" License
5.57k stars 507 forks source link

batching/dynamic batching #112

Open nivibilla opened 7 months ago

nivibilla commented 7 months ago

Thanks for the amazing work! It really is super fast at bs=1.

Can batch usecases, or dynamic batching be supported?

Chillee commented 7 months ago

It is not so difficult to modify it to support batch usecases, but supporting dynamic batching is quite a bit more work.

If you really want continuous batching I would suggest looking at projects like vllm or TensorRT-LLM for now.

Ying1123 commented 2 months ago

For anyone interested in this issue, we have successfully integrated torch.compile into a dynamic batching serving system: https://github.com/sgl-project/sglang.

We use flashinfer for attention kernels and torch.compile for all other parts. We found this combination makes it faster than TensorRT-LLM and original gpt-fast. It is also much faster than vLLM. It supports all other features such as continuous batching and prefix caching.

You can give it a try

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B --enable-torch-compile