Closed eshoyuan closed 1 year ago
Hi @eshoyuan, thanks a lot for your feedback! For LLM inference, we currently use vllm's inference engine. In terms of speed, it's among the fastest open source solutions out there, as you can see in their blog post.
If I understand correctly, you meant that we should consider batching inputs in order to get faster responses. Vllm actually does this out of the box. Have you experienced slow responses when using Haven?
Thank you for your response. I've read through the readme and documentation but didn't find any mention of vllm. I haven't tried it yet. The fact that you're using vllm is fantastic news, and I will definitely give it a try.
Firstly, I want to express my gratitude for your work on this repository. It's been incredibly useful and I appreciate the effort that has gone into it.
I'm reaching out with a question regarding the support for LLM inference acceleration. I've noticed that the inference speed for LLM can be quite slow when the batch size is set to 1. It's well known that increasing the batch size can often lead to a significant boost in inference speed. Additionally, using other repositories like llama.cpp can also help with acceleration.
I'm curious to know if there are any measures in place in this repository to speed up LLM inference, or if there are plans to add such support in the future. I believe that this could be a valuable enhancement for many users of this repository, particularly those working with LLM.