Open jueming0312 opened 2 months ago
Hi @jueming0312, thanks for your interest in MInference.
MInference is a method to accelerate self-deployed LLM inference in long-context scenarios. It does not support acceleration for API-based LLMs.
I'm seconding this: vLLM is a self-deployed LLM inference engine, but it does support model serving over an OpenAI-compatible API, which is what @jueming0312 is asking about. If this is a goal of the project, I would suggest publishing a package that bundles the vLLM server code with the MInference patch.
Describe the issue
Can I run "python -m vllm.entrypoints.openai.api_server" to load MInference capabilities in VLLM?