`tool_calls` gives sporadic EoF parsing errors

System Info

Privately hosted instance of TGI Version: 2.1.1

Deployed as a standalone kserve predictor Model: Mixtral-8x7b-instruct GPU: single A100

2024-07-16T16:27:39.016656Z  INFO text_generation_launcher: Args {
    model_id: "/mnt/models",
    revision: None,
    validation_workers: 2,
    sharded: Some(
        false,
    ),
    num_shard: None,
    quantize: Some(
        Awq,
    ),
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 160,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: Some(
        4095,
    ),
    max_total_tokens: Some(
        4096,
    ),
    waiting_served_ratio: 1.2,
    max_batch_prefill_tokens: Some(
        4296,
    ),
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: Some(
        [
            0,
        ],
    ),
hostname: "mixtral-8x7b-instruct-grammar-predictor-00002-deployment-b7gg29",
    port: 8080,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
}
2024-07-16T16:27:39.016784Z  INFO download: text_generation_launcher: Starting check and download process for /mnt/models
2024-07-16T16:27:40.084668Z  INFO text_generation_launcher: Detected system cuda
2024-07-16T16:27:41.298983Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-07-16T16:27:41.718951Z  INFO download: text_generation_launcher: Successfully downloaded weights for /mnt/models
2024-07-16T16:27:41.719189Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-07-16T16:27:42.916722Z  INFO text_generation_launcher: Detected system cuda
2024-07-16T16:27:49.011442Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-07-16T16:27:49.024694Z  INFO shard-manager: text_generation_launcher: Shard ready in 7.304671588s rank=0
2024-07-16T16:27:49.123620Z  INFO text_generation_launcher: Starting Webserver
2024-07-16T16:27:49.193399Z  INFO text_generation_router: router/src/main.rs:330: Overriding LlamaTokenizer with TemplateProcessing to follow python override defined in https://github.com/huggingface/transformers/blob/4aa17d00690b7f82c95bb2949ea57e22c35b4336/src/transformers/models/llama/tokenization_llama_fast.py#L203-L205
2024-07-16T16:27:49.193431Z  INFO text_generation_router: router/src/main.rs:345: Using config Some(Mixtral)
2024-07-16T16:27:49.193435Z  WARN text_generation_router: router/src/main.rs:354: no pipeline tag found for model /mnt/models
2024-07-16T16:27:49.193437Z  WARN text_generation_router: router/src/main.rs:372: Invalid hostname, defaulting to 0.0.0.0
2024-07-16T16:27:49.195599Z  INFO text_generation_router::server: router/src/server.rs:1567: Warming up model
2024-07-16T16:27:56.267189Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [0]
2024-07-16T16:27:56.267554Z  INFO text_generation_router::server: router/src/server.rs:1594: Using scheduler V3
2024-07-16T16:27:56.267571Z  INFO text_generation_router::server: router/src/server.rs:1646: Setting max batch total tokens to 394304
2024-07-16T16:27:56.279154Z  INFO text_generation_router::server: router/src/server.rs:1884: Connected

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Steps to reproduce:

Send payload to TGI via http POST request (/v1/chat/completions):

'model': 'tgi', 'messages': [{'role': 'user', 'content': "<s>[INST] I'm working on a report about a basketball player's average performance throughout the season. The data I have includes the points they scored in each game: 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160. To complete my analysis, I need to calculate the mean score per game. Can you help me with that? [/INST]"}], 'temperature': 0.001, 'tools': [{'type': 'function', 'function': {'name': 'calculate_mean', 'description': 'Calculates the mean of a list of numbers.', 'parameters': {'type': 'dict', 'properties': {'numbers': {'type': 'array', 'items': {'type': 'number'}, 'description': 'The list of numbers.'}}, 'required': ['numbers']}}}], 'tool_choice': 'calculate_mean'}

headers={"Content-type": "application/json"}

Receive the following error: {'error': 'EOF while parsing a list at line 1 column 127', 'error_type': 'Input validation error'}

Some more samples and relevant errors:

Payload: {'model': 'tgi', 'messages': [{'role': 'user', 'content': "\<s>[INST] I'm working on a report about a basketball player's average performance throughout the season. The data I have includes the points they scored in each game: 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160. To complete my analysis, I need to calculate the mean score per game. Can you help me with that? [/INST]"}], 'temperature': 0.001, 'tools': [{'type': 'function', 'function': {'name': 'calculate_mean', 'description': 'Calculates the mean of a list of numbers.', 'parameters': {'type': 'dict', 'properties': {'numbers': {'type': 'array', 'items': {'type': 'number'}, 'description': 'The list of numbers.'}}, 'required': ['numbers']}}}], 'tool_choice': 'calculate_mean'}

Response: {'error': 'EOF while parsing a list at line 1 column 127', 'error_type': 'Input validation error'}

Payload: {'model': 'tgi', 'messages': [{'role': 'user', 'content': "\<s>[INST] I'm working on a project that involves comparing the attributes of different entities to determine how similar they are. I have two entities represented by numerical arrays, and I need to use cosine similarity as a measure of similarity between them. The attributes for the first entity are [0.3, 0.8, 0.1, 0.6, 0.2], and for the second entity, they are [0.5, 0.7, 0.4, 0.9, 0.3]. Could you calculate the cosine similarity for these two vectors for me? [/INST]"}], 'temperature': 0.001, 'tools': [{'type': 'function', 'function': {'name': 'calculate_cosine_similarity', 'description': 'Calculates the cosine similarity of two vectors.', 'parameters': {'type': 'dict', 'properties': {'vectorA': {'type': 'array', 'items': {'type': 'number'}, 'description': 'The first vector.'}, 'vectorB': {'type': 'array', 'items': {'type': 'number'}, 'description': 'The second vector.'}}, 'required': ['vectorA', 'vectorB']}}}], 'tool_choice': 'calculate_cosine_similarity'}

Response: {'error': 'EOF while parsing a value at line 91 column 0', 'error_type': 'Input validation error'}

Payload: {'model': 'tgi', 'messages': [{'role': 'user', 'content': "\<s>[INST] I'm planning a business trip to New York, and I've decided to extend my stay to enjoy the city a bit more. I'd like to book a deluxe room for the duration of my trip. The dates I'm looking at are from August 11, 2024, to August 15, 2024. I've got a budget set aside for accommodation, and I'm willing to spend up to $1000 for a comfortable stay. My customer ID is 123. Could you go ahead and book that room for me? [/INST]"}], 'temperature': 0.001, 'tools': [{'type': 'function', 'function': {'name': 'book_room', 'description': 'Books a room for a customer.', 'parameters': {'type': 'dict', 'properties': {'room_type': {'type': 'string', 'description': 'The room type to book.'}, 'price': {'type': 'number', 'description': 'The max price of the room. Default 0.0'}, 'check_in_date': {'type': 'string', 'description': 'The check-in date in format of MM-DD-YYYY. '}, 'check_out_date': {'type': 'string', 'description': 'The check-out date in format of MM-DD-YYYY.'}, 'customer_id': {'type': 'string', 'description': 'The customer ID.'}, 'discount_code': {'type': 'string', 'description': 'The discount code (if any).', 'default': None}}, 'required': ['room_type', 'check_in_date', 'check_out_date', 'customer_id']}}}], 'tool_choice': 'book_room'}

Error: {'error': 'EOF while parsing a string at line 7 column 112', 'error_type': 'Input validation error'}

Stack trace from server:

2024-07-16T21:50:30.018992Z ERROR text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/interegular/patterns.py", line 486, in parse
    return super(_ParsePattern, self).parse()
  File "/opt/conda/lib/python3.10/site-packages/interegular/utils/simple_parser.py", line 63, in parse
    raise NoMatch(self.data, max(self._expected), self._expected[max(self._expected)])
interegular.utils.simple_parser.NoMatch: Can not match at index 858. Got '))?[\\', expected any of ['*', '+', '?', '{', '*', '+', '?', '{', '(', '[', '\\', '.', '$', '^', "<Any 1 except ('.', '?', '\\\\', '(', ')', '|', '*', '[', '^', '$', '+')>", '|'].
Context(data[-10:+10]): '*"[\\n ]*\\}))?[\\n ]*\\'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
 File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 145, in Prefill
    batch = self.model.batch_type.from_pb(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 442, in from_pb
    return cls.from_tokenized(pb, tokenizer, batch_tokenized_inputs, dtype, device)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 322, in from_tokenized
    next_token_chooser = HeterogeneousNextTokenChooser.from_pb(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/tokens.py", line 486, in from_pb
    return HeterogeneousNextTokenChooser(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/tokens.py", line 284, in __init__
    HeterogeneousGrammarLogitProcessor(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/logits_process.py", line 570, in __init__
    fsm = GrammarLogitProcessor._cached_compile_fsm(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/logits_process.py", line 527, in _cached_compile_fsm
    fsm = RegexFSM(schema, tokenizer)
  File "/opt/conda/lib/python3.10/site-packages/outlines/fsm/fsm.py", line 121, in __init__
    self.states_to_token_maps, self.empty_token_ids = create_states_mapping(
  File "/opt/conda/lib/python3.10/site-packages/outlines/caching.py", line 74, in wrapper
    result = cached_function(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/outlines/fsm/fsm.py", line 102, in create_states_mapping
    regex_pattern = interegular.parse_pattern(regex_string)
  File "/opt/conda/lib/python3.10/site-packages/interegular/patterns.py", line 730, in parse_pattern
    out = p.parse()
  File "/opt/conda/lib/python3.10/site-packages/interegular/utils/simple_parser.py", line 38, in w
    return m(self, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/interegular/patterns.py", line 488, in parse
    raise InvalidSyntax
interegular.patterns.InvalidSyntax

Expected behavior

Expected behaviour:

The model returns a valid response conforming to the syntax, such as the following (which was generated by the same model, as part of the same dataset, for example:


Payload: {'model': 'tgi', 'messages': [{'role': 'user', 'content': "\<s>[INST] To better understand the volatility and risk associated with this particular stock, I need to calculate the standard deviation of its daily closing prices over the past 10 trading days. Here are the figures I've gathered: 1000, 2000, 3000, 4000, 5000, 7000, 9000, 15000, 20000, and 30000. Can you provide me with the standard deviation for these closing prices? [/INST]"}], 'temperature': 0.001, 'tools': [{'type': 'function', 'function': {'name': 'calculate_standard_deviation', 'description': 'Calculates the standard deviation of a list of numbers.', 'parameters': {'type': 'dict', 'properties': {'numbers': {'type': 'array', 'items': {'type': 'number'}, 'description': 'The list of numbers.'}}, 'required': ['numbers']}}}], 'tool_choice': 'calculate_standard_deviation'}

Response: {'object': 'chat.completion', 'id': '', 'created': 1721167846, 'model': '/mnt/models', 'system_fingerprint': '2.1.1-sha-4dfdb48', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'tool_calls': [{'id': '0', 'type': 'function', 'function': {'description': None, 'name': 'stdev', 'arguments': {'numbers': [1000, 2000, 3000, 4000, 5000, 7000, 9000, 15000, 20000, 30000]}}}]}, 'logprobs': None, 'finish_reason': 'eos_token'}], 'usage': {'prompt_tokens': 141, 'completion_tokens': 94, 'total_tokens': 235}}

Payload: {'model': 'tgi', 'messages': [{'role': 'user', 'content': "\[INST] I've been tracking the scoring performance of a certain basketball player across the last 12 games to get insights into his consistency. The points he scored in each game are as follows: 30, 20, 25, 12, 59, 23, 64, 21, 67, 12, 23, and 43. I need to calculate the standard deviation of this scoring to better understand the variability and predictability of his performance. Could you help me with that? [/INST]"}], 'temperature': 0.001, 'tools': [{'type': 'function', 'function': {'name': 'calculate_standard_deviation', 'description': 'Calculates the standard deviation of a list of numbers.', 'parameters': {'type': 'dict', 'properties': {'numbers': {'type': 'array', 'items': {'type': 'number'}, 'description': 'The list of numbers.'}}, 'required': ['numbers']}}}], 'tool_choice': 'calculate_standard_deviation'}

Response: {'object': 'chat.completion', 'id': '', 'created': 1721167856, 'model': '/mnt/models', 'system_fingerprint': '2.1.1-sha-4dfdb48', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'tool_calls': [{'id': '0', 'type': 'function', 'function': {'description': None, 'name': 'calculateStandardDeviation', 'arguments': {'numbers': [30, 20, 25, 12, 59, 23, 64, 21, 67, 12, 23, 43]}}}]}, 'logprobs': None, 'finish_reason': 'eos_token'}], 'usage': {'prompt_tokens': 135, 'completion_tokens': 72, 'total_tokens': 207}}

Payload: {'model': 'tgi', 'messages': [{'role': 'user', 'content': "\[INST] I'm currently tweaking a machine learning model and I need to understand the similarity between two objects in my dataset. Their characteristics are expressed in the feature vectors [0.5, 0.7, 0.2, 0.9, 0.1] for the first object and [0.4, 0.6, 0.3, 0.8, 0.2] for the second one. Could you calculate the cosine similarity between these two feature vectors to help me determine how similar these objects are? [/INST]"}], 'temperature': 0.001, 'tools': [{'type': 'function', 'function': {'name': 'calculate_cosine_similarity', 'description': 'Calculates the cosine similarity of two vectors.', 'parameters': {'type': 'dict', 'properties': {'vectorA': {'type': 'array', 'items': {'type': 'number'}, 'description': 'The first vector.'}, 'vectorB': {'type': 'array', 'items': {'type': 'number'}, 'description': 'The second vector.'}}, 'required': ['vectorA', 'vectorB']}}}], 'tool_choice': 'calculate_cosine_similarity'}

Response: {'object': 'chat.completion', 'id': '', 'created': 1721167860, 'model': '/mnt/models', 'system_fingerprint': '2.1.1-sha-4dfdb48', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'tool_calls': [{'id': '0', 'type': 'function', 'function': {'description': None, 'name': 'cosine_similarity', 'arguments': {'vectorA': [0.5, 0.7, 0.2, 0.9, 0.1], 'vectorB': [0.4, 0.6, 0.3, 0.8, 0.2]}}}]}, 'logprobs': None, 'finish_reason': 'eos_token'}], 'usage': {'prompt_tokens': 133, 'completion_tokens': 76, 'total_tokens': 209}}

We get 72/100 of the completions from the Berkeley completion dataset, specifically https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/data/gorilla_openfunctions_v1_test_executable_simple.json in a valid format, as above. The other 28 give the error. (We updated the functions slightly, for example replacing "tuple" with "prefixItems" to conform to the json schema, and "float" with "number"). ### Other things that we have tried - Changing the payload format and using the `/generate` endpoint. This works in that we get valid responses that generally conform to the grammar. However, some responses just give whitespace till the token limit, and some others give valid json that only use some of the "required" parameters. - Removing the [INST] and \<s> tags. This gives the same error. - Trying without `tool_choice`. This works, but just generated a regular chat response with no JSON. Setting this either to the tool name or "auto" results in the error above. Trying without both also just results in a valid, regular chat response. - Manually calling the outlines parser on the problematic samples using

regex = outlines.fsm.json_schema.build_regex_from_schema(inp_json) interegular.parse_pattern(regex)

This parser seems to throw an error despite the schema being valid according to jsonschema validators, until we remove the "function" part of the payload, but then both failing and succeeding function definitions seem to work, so this didn't really shed any light on the issue. - Saw the suggestion mentioned by the author of https://github.com/huggingface/text-generation-inference/issues/2145 , re: dumping json with `ensure_ascii=False`, but this didn't help. Thank you very much in advance for your help!

huggingface / text-generation-inference