vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.05k stars 3.97k forks source link

[Bug]: Unhandled tool invocations with Llama 3.1 using LangChain and OpenAI-compatible API #7223

Open dolanp83 opened 1 month ago

dolanp83 commented 1 month ago

Your current environment

Using vllm/vllm-openai:v0.5.3.post1 Docker image. Executed within the container:

Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.31

Python version: 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-1008-nvidia-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S
GPU 2: NVIDIA L40S
GPU 3: NVIDIA L40S

Nvidia driver version: 550.90.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
Address sizes:                        46 bits physical, 57 bits virtual
CPU(s):                               128
On-line CPU(s) list:                  0-127
Thread(s) per core:                   2
Core(s) per socket:                   32
Socket(s):                            2
NUMA node(s):                         2
Vendor ID:                            GenuineIntel
CPU family:                           6
Model:                                143
Model name:                           Intel(R) Xeon(R) Gold 6448Y
Stepping:                             8
CPU MHz:                              799.149
CPU max MHz:                          4100.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4200.00
Virtualization:                       VT-x
L1d cache:                            3 MiB
L1i cache:                            2 MiB
L2 cache:                             128 MiB
L3 cache:                             120 MiB
NUMA node0 CPU(s):                    0-31,64-95
NUMA node1 CPU(s):                    32-63,96-127
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:             Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Vulnerable; BHI: Vulnerable (Syscall hardening enabled)
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hfi vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] flashinfer==0.0.9+cu121torch2.3
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] pyzmq==26.0.3
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.2
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  PIX SYS SYS 0-31,64-95  0       N/A
GPU1    PIX  X  SYS SYS 0-31,64-95  0       N/A
GPU2    SYS SYS  X  PIX 32-63,96-127    1       N/A
GPU3    SYS SYS PIX  X  32-63,96-127    1       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I have been having a lot of trouble getting vLLMs Open AI-compatible API to work with various LangChain/LangGraph tools, so I distilled this down to a contrived example. When a tool call is supposed to happen, the response from vLLM just has the <python_tag>... metadata from Llama3.1, but it is unable to call the function.

from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder, SystemMessagePromptTemplate, \
    HumanMessagePromptTemplate
from langchain_core.tools import tool
from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI

@tool("multiply-tool")
def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    print(f"multiply() invoked with a={a}, b={b}")
    return a * b

# llm = ChatOpenAI(
#     model="gpt-4o-mini",
#     openai_api_key="<secret>",
#     temperature=0
# )
llm = ChatOpenAI(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    openai_api_key="EMPTY",
    openai_api_base="http://<server>:8000/v1",
    temperature=0
)
# llm = ChatOllama(model="llama3.1:70b", temperature=0, base_url="http://<server>:11434")

system_messages = """
You are an AI assistant that can multiply two numbers together.\n
You have access to a tool called 'multiply-tool' which you will use to multiply numbers.
"""

prompt = ChatPromptTemplate.from_messages(
        [
            SystemMessagePromptTemplate.from_template(system_messages),
            HumanMessagePromptTemplate.from_template("{input}"),
            MessagesPlaceholder(variable_name="agent_scratchpad"),
        ]
)
tools = [multiply]
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

print("Asking question...")
input = {
    "input": "what is 6 times 8?"
}
response = agent_executor.invoke(input)
print(response)

Here's the output when this is run with OpenAI:

> Entering new AgentExecutor chain...

Invoking: `multiply-tool` with `{'a': 6, 'b': 8}`

multiply() invoked with a=6, b=8
48
6 times 8 is 48.

> Finished chain.
{'input': 'what is 6 times 8?', 'output': '6 times 8 is 48.'}

Here is the output when this is run with Ollama:

Asking question...

> Entering new AgentExecutor chain...

Invoking: `multiply-tool` with `{'a': 6, 'b': 8}`

multiply() invoked with a=6, b=8
48
The result of multiplying 6 and 8 is 48.

And here is the output when used with vLLM:

Asking question...

> Entering new AgentExecutor chain...
<|python_tag|>{"name": "multiply-tool", "parameters": {"a": "6", "b": "8"}}

> Finished chain.
{'input': 'what is 6 times 8?', 'output': '<|python_tag|>{"name": "multiply-tool", "parameters": {"a": "6", "b": "8"}}'}

Does something else need to be added to the workflow to get this to work with vLLM and Llama 3.1? Thanks!

mohit-zangoh commented 1 month ago

+1

iqinning commented 1 month ago

+1

ex10ded commented 2 weeks ago

A very ugly, non-production hack I've had some success with is

@tool
def multiply(a: str, b: str) -> int:
    """Multiply two numbers."""
    print(f"multiply() invoked with a={a}, b={b}")
    return int(a) * int(b)

# tools: List[StructuredTool] = [get_entity, invoke_home_assistant_service, brave_search, multiply]
tools: List[StructuredTool] = [multiply]

model = ChatOpenAI(
    model="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
    # temperature=0.75,
    # num_ctx=16 * 1024,
    base_url="http://localhost:8000/v1",
    openai_api_key="sk-1234567890",
    verbose=True,
)

llm_with_tools = model.bind_tools(tools)

def parse_tool_call(x: Any):
    if isinstance(x, AIMessage):
        if x.content.startswith("<|python_tag|>"):
            x.content = x.content.replace("<|python_tag|>", "")
        try:
            parsed_call = json.loads(x.content)
            if parsed_call.get("name") is not None and parsed_call.get("parameters") is not None:
                x.tool_calls = [ToolCall(name=parsed_call["name"], args=parsed_call["parameters"], id=x.id)]
        except json.JSONDecodeError:
            pass
    return x

chain = llm_with_tools | RunnableLambda(parse_tool_call)

and then when running

You: what is 6 times 8
debug: {"name": "multiply", "parameters": {"a": "6", "b": "8"}}
debug: Function call: {'name': 'multiply', 'args': {'a': '6', 'b': '8'}, 'id': 'run-11debe90-b2bf-4067-a006-e59b43da8606-0'}
debug: multiply() invoked with a=6, b=8
debug: Function result: 48
AI: 48, I hope that is correct, sir. Shall I be of further assistance?