PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.31
Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.10.0-30-cloud-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.2.128
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L4
Nvidia driver version: 550.90.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping: 7
CPU MHz: 2200.196
BogoMIPS: 4400.39
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 64 KiB
L1i cache: 64 KiB
L2 cache: 2 MiB
L3 cache: 38.5 MiB
NUMA node0 CPU(s): 0-3
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonsto
p_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsba
se tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
Versions of relevant libraries:
[pip3] flashinfer==0.0.8+cu121torch2.3
[pip3] numpy==1.25.2
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] flashinfer 0.0.8+cu121torch2.3 pypi_0 pypi
[conda] numpy 1.25.2 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] torchvision 0.18.0 pypi_0 pypi
[conda] transformers 4.42.3 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-3 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
🐛 Describe the bug
Whenever I try to make a request with guided decoding with Phi-3-small, the request fails. Even using the OpenAI client object with response_format={'type': 'json_object'} fails. Here is an example prompt that crashes for me:
curl http:/localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-3-small-8k-instruct",
"prompt": "San Francisco is a ",
"max_tokens": 7,
"temperature": 0,
"guided_json": {
"type": "string",
"enum": ["city", "state"]
}
}'
Traceback:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/opt/conda/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
return await self.app(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 147, in create_completion
generator = await openai_serving_completion.create_completion(
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_completion.py", line 109, in create_completion
await get_guided_decoding_logits_processor(
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/guided_decoding/__init__.py", line 20, in get_guided_decoding_logits_processor
return await get_outlines_guided_decoding_logits_processor(
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/guided_decoding/outlines_decoding.py", line 75, in get_outlines_guided_decoding_logits_processor
return await loop.run_in_executor(global_thread_pool,
File "/opt/conda/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/guided_decoding/outlines_decoding.py", line 117, in _get_logits_processor
return JSONLogitsProcessor(guide, tokenizer, whitespace_pattern)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 123, in __init__
super().__init__(regex_string, tokenizer)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 88, in __init__
RegexLogitsProcessor._get_guide(regex_string, tokenizer))
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 74, in _get_guide
return RegexGuide(regex_string, tokenizer)
File "/opt/conda/lib/python3.10/site-packages/outlines/fsm/guide.py", line 145, in __init__
) = create_states_mapping(regex_string, tokenizer)
File "/opt/conda/lib/python3.10/site-packages/outlines/caching.py", line 119, in wrapper
result = wrapper.__memory__.get(cache_key, default=ENOVAL, retry=True)
File "/opt/conda/lib/python3.10/site-packages/diskcache/core.py", line 1149, in get
db_key, raw = self._disk.put(key)
File "/opt/conda/lib/python3.10/site-packages/outlines/caching.py", line 20, in put
data = cloudpickle.dumps(key)
File "/opt/conda/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1479, in dumps
cp.dump(obj)
File "/opt/conda/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1245, in dump
return super().dump(obj)
TypeError: cannot pickle '_thread.RLock' object```
Thank you.
Your current environment
🐛 Describe the bug
Whenever I try to make a request with guided decoding with Phi-3-small, the request fails. Even using the OpenAI client object with
response_format={'type': 'json_object'}
fails. Here is an example prompt that crashes for me:Traceback: