microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
696 stars 25 forks source link

[Question]: How does VLLM use MInference through OpenAI Compatible Server? #40

Open jueming0312 opened 2 months ago

jueming0312 commented 2 months ago

Describe the issue

Can I run "python -m vllm.entrypoints.openai.api_server" to load MInference capabilities in VLLM?

iofu728 commented 2 months ago

Hi @jueming0312, thanks for your interest in MInference.

MInference is a method to accelerate self-deployed LLM inference in long-context scenarios. It does not support acceleration for API-based LLMs.

kalocide commented 1 month ago

I'm seconding this: vLLM is a self-deployed LLM inference engine, but it does support model serving over an OpenAI-compatible API, which is what @jueming0312 is asking about. If this is a goal of the project, I would suggest publishing a package that bundles the vLLM server code with the MInference patch.