Open vgoklani opened 7 months ago
Hi @vgoklani - let me check and get back to you this week. I believe we have continuous batching in TorchServe but let me verify.
Hi @lessw2020 - first i want to say thank you for your YouTube videos on FSDP!!!
For continuous/dynamic batching, we really want something that's in python :) where it's easy to tweak the server. As the main bottleneck is the GPU related generation (at least for LLMs), there is only a marginal benefit to using a Rust/Java based web server framework. Nevertheless, all the main frameworks (i.e. TGI and vLLM) are not in python. Thanks!
Hi @vgoklani - got it, thanks for your feedback. This has generated a discussion about possibly making a reference architecture to showcase these type of features. Let me leave this issue open and will update if it turns into a real effort.
Do you know of a good example for continuous batching? We would like to combine that with the paged attention kernel to build a own simple serving solution.
Thanks!