vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.64k stars 4.47k forks source link

[Usage]: how do I pass in the JSON content-type for ASYNC Mistral 7B offline inference #7908

Open fatihyildiz-cs opened 2 months ago

fatihyildiz-cs commented 2 months ago

Your current environment

PyTorch version: 2.2.2
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.5 (x86_64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.6 (v3.11.6:8b6ee5ba3b, Oct  2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)] (64-bit runtime)
Python platform: macOS-14.5-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] sentence-transformers==2.2.2
[pip3] torch==2.2.2
[pip3] torchvision==0.17.2
[pip3] transformers==4.42.3
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

How would you like to use vllm

Similar to #7030 , I would like to use the JSON mode for Mistral 7B while doing offline inference using the generate method, but asynchronously. We would like to stream the response to our app and thought that we can use the Async Llm Engine to do that. Llm Class wraps the Sync LlmEngine and inserts the json schema thanks to a recent PR, but there is no wrapper class for the Asnyc Engine and hence we cannot supply a schema to it using the plain AsyncLLMEngine class. Is there a plan to integrate a wrapper (an AsyncLlm class) for the async engine? We appreciate any suggestions. Thank you

Before submitting a new issue...

RoopeHakulinen commented 2 months ago

I'd need this too. Would be great to understand if there's some way to make this happen with the current version 🙂

fatihyildiz-cs commented 1 month ago

Still haven't found a solution for this. I'd appreciate any tips.