Open nivibilla opened 7 months ago
It is not so difficult to modify it to support batch usecases, but supporting dynamic batching is quite a bit more work.
If you really want continuous batching I would suggest looking at projects like vllm or TensorRT-LLM for now.
For anyone interested in this issue, we have successfully integrated torch.compile into a dynamic batching serving system: https://github.com/sgl-project/sglang.
We use flashinfer for attention kernels and torch.compile for all other parts. We found this combination makes it faster than TensorRT-LLM and original gpt-fast. It is also much faster than vLLM. It supports all other features such as continuous batching and prefix caching.
You can give it a try
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B --enable-torch-compile
Thanks for the amazing work! It really is super fast at bs=1.
Can batch usecases, or dynamic batching be supported?