Evaluate vLLM - Githubissues

onecx-apps / onecx-chat

OneCx chat Management

Apache License 2.0

0 stars 0 forks source link

Closed lmitlaender closed 9 months ago

lmitlaender commented 9 months ago

This issues goal is to evaluate the vllm-project for llm serving (in comparison to ollama).

lmitlaender commented 9 months ago

Difference: vLLM only works on Linux and will lead to an error on windows on installation. At the same time vLLM only supports GPU Inference which shouldnt be a too big limitation.

lmitlaender commented 9 months ago

Provides faster SOTA Optimization support like TensorParallelism, Continuous Batching
Works with most standard Huggingface Model Architectures, and supports AWQ Quantization
Integrates with Ray Serve for Full Scale distributed computing

lmitlaender commented 9 months ago

lmitlaender commented 9 months ago

https://www.youtube.com/watch?v=TJ5K1CO9Wbs See parts of this video for some of the speedups that vllm can support

lmitlaender commented 9 months ago

Might be interesting: https://github.com/ray-project/ray-llm integrates vllm with ray in a single solution combining multiple optimizations