PyTorch version: 2.2.2
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 14.5 (x86_64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: Could not collect
Libc version: N/A
Python version: 3.11.6 (v3.11.6:8b6ee5ba3b, Oct 2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)] (64-bit runtime)
Python platform: macOS-14.5-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] sentence-transformers==2.2.2
[pip3] torch==2.2.2
[pip3] torchvision==0.17.2
[pip3] transformers==4.42.3
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect
How would you like to use vllm
Similar to #7030 , I would like to use the JSON mode for Mistral 7B while doing offline inference using the generate method, but asynchronously. We would like to stream the response to our app and thought that we can use the Async Llm Engine to do that. Llm Class wraps the Sync LlmEngine and inserts the json schema thanks to a recent PR, but there is no wrapper class for the Asnyc Engine and hence we cannot supply a schema to it using the plain AsyncLLMEngine class. Is there a plan to integrate a wrapper (an AsyncLlm class) for the async engine? We appreciate any suggestions. Thank you
Before submitting a new issue...
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Your current environment
How would you like to use vllm
Similar to #7030 , I would like to use the JSON mode for Mistral 7B while doing offline inference using the generate method, but asynchronously. We would like to stream the response to our app and thought that we can use the Async Llm Engine to do that. Llm Class wraps the Sync LlmEngine and inserts the json schema thanks to a recent PR, but there is no wrapper class for the Asnyc Engine and hence we cannot supply a schema to it using the plain
AsyncLLMEngine
class. Is there a plan to integrate a wrapper (anAsyncLlm
class) for the async engine? We appreciate any suggestions. Thank youBefore submitting a new issue...