Open saswat0 opened 7 months ago
Hi,
Thanks for the suggestion. How do you think batched prompts can be useful in the context of RAG?
One that I can think of is that, if deployed into production, the server could queue the requests (prompts) and the RAG would run only once. Effectively, the time difference would be slightly higher but GPu utilisation would increase by several folds
I will add it as a potential improvement when implementing support for vLLM in the future. Thanks for the suggestion.
@snexus Kudos on this awesome project!
I was wondering if support for batched prompts is in your roadmap? There are solutions that make this possible for several language models, so are you planning on including these optimisations in your source?
TIA