huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.78k stars 1.02k forks source link

Tool call performs worse on v2.2.0 as compared to latest #2413

Open varad0309 opened 1 month ago

varad0309 commented 1 month ago

System Info

gpu=0
num_gpus=1
model=meta-llama/Meta-Llama-3.1-8B-Instruct
docker run -d \
--gpus "\"device=$gpu\"" \
--shm-size 16g \
-e HUGGING_FACE_HUB_TOKEN=$token \
-p 8082:80 \
-v $volume:/data \
--name Meta-Llama-3.1-8B \
ghcr.io/huggingface/text-generation-inference:sha-1cebccc@sha256:4ccb775aaaefc90df10b2de7ce17a1f00a07682c12ea9630e6e6fdfa10a1c05e \
--model-id $model \
--max-concurrent-requests $max_concurrent_request \
--max-total-tokens $max_total_token \
--max-input-length $max_input_length \
--waiting-served-ratio $wsr \
--num-shard $num_gpus \
--dtype bfloat16

OS: Ubuntu Linux Model: meta-llama/Meta-Llama-3.1-8B-Instruct / meta-llama/Meta-Llama-3-8B-Instruct Hardware: A100 80G Version with issue: v2.2.0 Compared with: latest

Information

Tasks

Reproduction

  1. Launch the docker instance.
  2. Run the following:
client = OpenAI(
        base_url=f"http://127.0.0.1:8082/v1",
        api_key="_",
    )

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=messages,
    tools=tools,
    tool_choice="auto",
    max_tokens=max_tokens,
)

predictions = chat_completion.choices[0].message.tool_calls

Expected behavior

Hey @drbh @ErikKaum, did you try benchmarking the performance of v2.2.0 against latest on tool calling? I am getting dramatically worse performance on v2.2.0 as compared to using the previous versions on some tool-call benchmarks I have created. Just changing the version causes the performance of meta-llama/Meta-Llama-3-8B-Instruct to drop from 0.66 to 0.08 on the same script and data. Ofc, I can't evaluate the performance of Llama-3.1 on previous versions, but the its' performance is similarly close to 0.

drbh commented 1 month ago

Hi @varad0309 thanks for opening this issue, v2.2.0 was released ~3 weeks ago and TGI since has had some bug fixes and improvements that are available on latest. Specifically, a fix for a tool related bug was was merged yesterday https://github.com/huggingface/text-generation-inference/pull/2406 and that would likely improve tool calling responses.

We will be publishing a newer release in the coming week/weeks and it should include these fixes along with many other improvements! For now I'd recommend using latest or a pinned commit to ensure you are using a version with the tool fixes. Thanks again!

varad0309 commented 1 month ago

@drbh thanks for the quick reply. I did try a commit from few hrs back (more specifically, sha-1cebccc@sha256:4ccb775aaaefc90df10b2de7ce17a1f00a07682c12ea9630e6e6fdfa10a1c05e). The problem still persists.

My observation: the list of available tools is still not getting passed appropriately.

drbh commented 1 month ago

oh apologies I must have misunderstood the issue, it sounds that tools responses have regressed starting at version 2.2.0 and onwards? Would you be able to share an example of the input and expected output? Additionally do you know when the tools were last working as you expected (maybe a version or best case the last commit sha)? Thanks!

varad0309 commented 1 month ago

Sure, here are a few examples. I unfortunately don't know the last version after which it starts breaking. The versions I am comparing are via docker images:

  1. (this works) version 1 => ghcr.io/huggingface/text-generation-inference:latest
  2. (this doesn't) version 2 => ghcr.io/huggingface/text-generation-inference:sha-1cebccc@sha256:4ccb775aaaefc90df10b2de7ce17a1f00a07682c12ea9630e6e6fdfa10a1c05e

Examples:

Ground truth: [{'name': 'search_hotel', 'arguments': {'destination': 'Paris', 'check_in_date': '2022-05-01', 'check_out_date': '2022-05-10'}}]
Version 1: [Function(arguments={'check_in_date': '2022-05-01', 'check_out_date': '2022-05-10', 'location': 'Paris', 'num_guests': 1, 'num_rooms': 1}, name='search_hotel', description=None)]
Version 2: [Function(arguments={'number': 7}, name='find_hotels', description=None)]
Ground truth: [{'name': 'roll_dice', 'arguments': {'sides': 6, 'quantity': 1}}]
Version 1: [Function(arguments={'quantity': 1, 'sides': 6}, name='roll_dice', description=None)]
Version 2: [Function(arguments={'artist': 'tools', 'genre': 'RNG Tools'}, name='random.randint', description=None)]
Ground truth: [{'name': 'calculate_fuel_cost', 'arguments': {'distance': 500, 'fuel_price': 1.2, 'fuel_efficiency': 10}}]
Version 1: [Function(arguments={'distance': 500, 'fuel_efficiency': 10, 'fuel_price': 1.2}, name='calculate_fuel_cost', description=None)]
Version 2: [Function(arguments={'distance': 500, 'fuel_efficiency': 10, 'fuel_price': 1.2}, name='calculate_fuel_consumption', description=None)]
drbh commented 3 weeks ago

Hi @varad0309 I believe these issues should be resolved by the recent improvements/bug fixes to grammars and tool calling (https://github.com/huggingface/text-generation-inference/pull/2463, https://github.com/huggingface/text-generation-inference/pull/2454, https://github.com/huggingface/text-generation-inference/pull/2391, etc...)

Would you kindly try the most recent container image ghcr.io/huggingface/text-generation-inference:sha-8f99f16? There were some changes directly related to the performance of meta-llama/Meta-Llama-3-8B-Instruct's tools so I believe this should improve for your use case. Thank you!

varad0309 commented 3 weeks ago
Turn / Tool Type sha-1cebccc sha-21187c2 sha-8f99f16
Single turn Irrelevant 0.02 0.27 0.27
Single turn Chat 0.64 0.28 0.28
Multi turn Irrelevant 0.0 0.18 0.18
Multi turn Chat 0.08 0.04 0.04

@drbh thanks for working on this!! Just ran some tests on different versions (older to newer as you go from col 1 to col 3), on a benchmark by setting temperature 0 using the above OpenAI chat completion script on meta-llama/Meta-Llama-3-8B-Instruct.

The function calling performance seems to be still dropping. Though it seems that the models' ability to filter out irrelevant tools is pretty good (benchmark is BFCL style).