vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.68k stars 4.65k forks source link

[Bug]: OpenAI API request doesn't go through with 'guided_json' #4439

Open Tejaswgupta opened 6 months ago

Tejaswgupta commented 6 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.2 (main, Sep 24 2023, 00:07:45) [Clang 15.0.0 (clang-1500.0.38.1)] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1

Versions of relevant libraries:
[pip3] flake8==6.0.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.3
[pip3] onnx==1.15.0
[pip3] onnxruntime==1.17.3
[pip3] torch==2.3.0
[pip3] torchaudio==2.1.1
[pip3] torchvision==0.16.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

🐛 Describe the bug

While calling the model without guided_json it works fine and returns the response quickly , however since we want structured output it's a hit or miss. As soon as I pass the Pydantic model in guided_json , the script just hangs and there's no request received on the server.

You can try the following code(guided_json commented out).

CODE: from pydantic import BaseModel

class RFPModel(BaseModel): rfp_title: str issuing_agency: str rfp_release_date: str submission_deadline: str primary_point_of_contact: str contract_type: str = None industry_sector: str = None eligibility_criteria: str = None scope_of_work_services_required: str = None budget_or_funding_amount: str = None place_of_performance: str = None procurement_method: str = None award_criteria: str = None additional_requirements: str = None

for c in all_splits: ex_prompt = f'''Analyze the text excerpt from a government Request for Proposal (RFP) provided below and extract the relevant metadata according to the predefined fields. Format the extracted information as a JSON object, where each field is represented as a key with its corresponding value. If a certain piece of information is not present in the current text chunk, but was included in previous chunks, incorporate that information as well. Use 'null' for any fields where data is not available in the current or previous chunks.

Extracted Metadata up until the current chunk:

{metadata}

Text Excerpt:

{c}

Please update the JSON object with the following structure, filling in each field with the extracted information or 'null' if the information is not available:

{{  
  "RFP Title": "[Extracted RFP Title]",  
  "Issuing Agency": "[Extracted Issuing Agency]",  
  "RFP Release Date": "[Extracted Release Date]",  
  "Submission Deadline": "[Extracted Submission Deadline]",  
  "Primary Point of Contact (POC)": "[Extracted POC]",  
  "Contract Type": "[Extracted Contract Type]",  
  "Industry Sector": "[Extracted Industry Sector]",  
  "Eligibility Criteria": "[Extracted Eligibility Criteria]",  
  "Scope of Work/Services Required": "[Extracted Scope of Work]",  
  "Budget or Funding Amount": "[Extracted Budget]",  
  "Place of Performance": "[Extracted Place of Performance]",  
  "Procurement Method": "[Extracted Procurement Method]",  
  "Award Criteria": "[Extracted Award Criteria]",  
  "Additional Requirements": "[Extracted Additional Requirements]"  
}}  

Ensure that the JSON keys are consistent with the metadata fields, and the values are accurately extracted from the RFP text. If the text chunk implies details that may relate to these fields without directly stating them, use inference to populate the fields appropriately. Your response should only be in English. Extracted JSON:''' out = client.chat.completions.create( messages=[ { "role": "system", "content": "You are an AI assistant that helps people extract relevant information from RFPs and structure it in JSON format. You should always respond in an accurate and honest manner.", }, {"role": "user", "content": ex_prompt}, ],

extra_body={

        # 'guided_json': RFPModel.model_json_schema(),
    # },
    model=model_name,
    temperature=0.1,
)

json_string = re.search(
            r'\{.*\}', out.choices[0].message.content, re.DOTALL).group()
json_data = json.loads(json_string)
print(json_data)
metadata.append(json_data)
wushixong commented 6 months ago

I also encountered the same issue, and when I loaded the model locally, VLLM became extremely slow. here is my code:

from outlines.integrations.vllm import JSONLogitsProcessor

llm = LLM(model="Qwen/Qwen1.5-32B-Chat-GPTQ-Int4",
                       dtype='float16',quantization='gptq',max_model_len=32768,
                       gpu_memory_utilization=0.9)
tokenizer = AutoTokenizer.from_pretrained( "Qwen/Qwen1.5-32B-Chat-GPTQ-Int4")
default_sampling_params = SamplingParams(temperature=0, max_tokens=1000, logits_processors=[])

prompt = tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True )
class Person(BaseModel):
      name:str=''
      age:int=0
default_sampling_params.logits_processors.append(JSONLogitsProcessor(llm=llm,schema=Person))

result = llm.generate([prompt],default_sampling_params,use_tqdm=False)

and my vllm==0.4.0

hmellor commented 5 months ago

Is it any faster with --guided-decoding-backend lm-format-enforcer?

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!