Open varad0309 opened 1 month ago
Hi @varad0309 thanks for opening this issue, v2.2.0
was released ~3 weeks ago and TGI since has had some bug fixes and improvements that are available on latest. Specifically, a fix for a tool related bug was was merged yesterday https://github.com/huggingface/text-generation-inference/pull/2406 and that would likely improve tool calling responses.
We will be publishing a newer release in the coming week/weeks and it should include these fixes along with many other improvements! For now I'd recommend using latest or a pinned commit to ensure you are using a version with the tool fixes. Thanks again!
@drbh thanks for the quick reply. I did try a commit from few hrs back (more specifically, sha-1cebccc@sha256:4ccb775aaaefc90df10b2de7ce17a1f00a07682c12ea9630e6e6fdfa10a1c05e
). The problem still persists.
My observation: the list of available tools is still not getting passed appropriately.
oh apologies I must have misunderstood the issue, it sounds that tools responses have regressed starting at version 2.2.0 and onwards? Would you be able to share an example of the input and expected output? Additionally do you know when the tools were last working as you expected (maybe a version or best case the last commit sha)? Thanks!
Sure, here are a few examples. I unfortunately don't know the last version after which it starts breaking. The versions I am comparing are via docker images:
ghcr.io/huggingface/text-generation-inference:latest
ghcr.io/huggingface/text-generation-inference:sha-1cebccc@sha256:4ccb775aaaefc90df10b2de7ce17a1f00a07682c12ea9630e6e6fdfa10a1c05e
Examples:
Ground truth: [{'name': 'search_hotel', 'arguments': {'destination': 'Paris', 'check_in_date': '2022-05-01', 'check_out_date': '2022-05-10'}}]
Version 1: [Function(arguments={'check_in_date': '2022-05-01', 'check_out_date': '2022-05-10', 'location': 'Paris', 'num_guests': 1, 'num_rooms': 1}, name='search_hotel', description=None)]
Version 2: [Function(arguments={'number': 7}, name='find_hotels', description=None)]
Ground truth: [{'name': 'roll_dice', 'arguments': {'sides': 6, 'quantity': 1}}]
Version 1: [Function(arguments={'quantity': 1, 'sides': 6}, name='roll_dice', description=None)]
Version 2: [Function(arguments={'artist': 'tools', 'genre': 'RNG Tools'}, name='random.randint', description=None)]
Ground truth: [{'name': 'calculate_fuel_cost', 'arguments': {'distance': 500, 'fuel_price': 1.2, 'fuel_efficiency': 10}}]
Version 1: [Function(arguments={'distance': 500, 'fuel_efficiency': 10, 'fuel_price': 1.2}, name='calculate_fuel_cost', description=None)]
Version 2: [Function(arguments={'distance': 500, 'fuel_efficiency': 10, 'fuel_price': 1.2}, name='calculate_fuel_consumption', description=None)]
Hi @varad0309 I believe these issues should be resolved by the recent improvements/bug fixes to grammars and tool calling (https://github.com/huggingface/text-generation-inference/pull/2463, https://github.com/huggingface/text-generation-inference/pull/2454, https://github.com/huggingface/text-generation-inference/pull/2391, etc...)
Would you kindly try the most recent container image ghcr.io/huggingface/text-generation-inference:sha-8f99f16
? There were some changes directly related to the performance of meta-llama/Meta-Llama-3-8B-Instruct
's tools so I believe this should improve for your use case. Thank you!
Turn / Tool Type | sha-1cebccc | sha-21187c2 | sha-8f99f16 |
---|---|---|---|
Single turn Irrelevant | 0.02 | 0.27 | 0.27 |
Single turn Chat | 0.64 | 0.28 | 0.28 |
Multi turn Irrelevant | 0.0 | 0.18 | 0.18 |
Multi turn Chat | 0.08 | 0.04 | 0.04 |
@drbh thanks for working on this!! Just ran some tests on different versions (older to newer as you go from col 1 to col 3), on a benchmark by setting temperature 0 using the above OpenAI chat completion script on meta-llama/Meta-Llama-3-8B-Instruct
.
The function calling performance seems to be still dropping. Though it seems that the models' ability to filter out irrelevant tools is pretty good (benchmark is BFCL style).
System Info
OS: Ubuntu Linux Model:
meta-llama/Meta-Llama-3.1-8B-Instruct
/meta-llama/Meta-Llama-3-8B-Instruct
Hardware: A100 80G Version with issue: v2.2.0 Compared with: latestInformation
Tasks
Reproduction
Expected behavior
Hey @drbh @ErikKaum, did you try benchmarking the performance of
v2.2.0
againstlatest
on tool calling? I am getting dramatically worse performance onv2.2.0
as compared to using the previous versions on some tool-call benchmarks I have created. Just changing the version causes the performance ofmeta-llama/Meta-Llama-3-8B-Instruct
to drop from 0.66 to 0.08 on the same script and data. Ofc, I can't evaluate the performance of Llama-3.1 on previous versions, but the its' performance is similarly close to 0.