This repo is great thank you for sharing!

pytorch-labs / applied-ai

Applied AI experiments and examples for PyTorch

BSD 3-Clause "New" or "Revised" License

162 stars 14 forks source link

This repo is great thank you for sharing! #15

Open vgoklani opened 7 months ago

vgoklani commented 7 months ago

Do you know of a good example for continuous batching? We would like to combine that with the paged attention kernel to build a own simple serving solution.

Thanks!

lessw2020 commented 7 months ago

Hi @vgoklani - let me check and get back to you this week. I believe we have continuous batching in TorchServe but let me verify.

vgoklani commented 7 months ago

Hi @lessw2020 - first i want to say thank you for your YouTube videos on FSDP!!!

For continuous/dynamic batching, we really want something that's in python :) where it's easy to tweak the server. As the main bottleneck is the GPU related generation (at least for LLMs), there is only a marginal benefit to using a Rust/Java based web server framework. Nevertheless, all the main frameworks (i.e. TGI and vLLM) are not in python. Thanks!

lessw2020 commented 7 months ago

Hi @vgoklani - got it, thanks for your feedback. This has generated a discussion about possibly making a reference architecture to showcase these type of features. Let me leave this issue open and will update if it turns into a real effort.