onecx-apps / onecx-chat

OneCx chat Management
Apache License 2.0
0 stars 0 forks source link

Evaluate vLLM #22

Closed lmitlaender closed 9 months ago

lmitlaender commented 9 months ago

This issues goal is to evaluate the vllm-project for llm serving (in comparison to ollama).

vllm source code: https://github.com/vllm-project/vllm

lmitlaender commented 9 months ago
  1. Difference: vLLM only works on Linux and will lead to an error on windows on installation. At the same time vLLM only supports GPU Inference which shouldnt be a too big limitation.
lmitlaender commented 9 months ago
  1. Provides faster SOTA Optimization support like TensorParallelism, Continuous Batching
  2. Works with most standard Huggingface Model Architectures, and supports AWQ Quantization
  3. Integrates with Ray Serve for Full Scale distributed computing
lmitlaender commented 9 months ago
  1. Isn't supposed to do any prompt magic, stronger seperation of concerns
lmitlaender commented 9 months ago
  1. Supports OpenAI API serving

https://www.youtube.com/watch?v=TJ5K1CO9Wbs See parts of this video for some of the speedups that vllm can support

lmitlaender commented 9 months ago

Might be interesting: https://github.com/ray-project/ray-llm integrates vllm with ray in a single solution combining multiple optimizations