ibndias commented 4 months ago

System Info

Hi, I am working on a model that uses ChatML format which has:

<|im_end|>\n at the end of its response.
</s> as the eos_token

When finish reason is `eos_token`, output is correct

When I did not add stop sequence, the eos token finish reason is triggered, and the output is correct, no eos_token added in the output.

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://10.125.121.102:12314/v1",
    api_key="-"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    # stop=["<|im_end|>\n", "</s>"],
    max_tokens=4096,
    stream=False
)

print(chat_completion)

Output:

ChatCompletion(id='', choices=[Choice(finish_reason='eos_token', index=0, logprobs=None, message=ChatCompletionMessage(content=' Deep learning is a subfield of machine learning, which itself is a subset of artificial intelligence (AI). It is a type of artificial neural network and computing system designed to recognize patterns and high-level abstractions in data through multiple layers of processing. \n\nIn deep learning, algorithms attempt to automatically improve their accuracy over time through learning and adapting to large amounts of data. These algorithms are designed to process complex data, such as speech, images, and video, and use the patterns they learn to make decisions or predictions.\n\nDeep learning techniques have shown remarkable success in a variety of applications such as computer vision, natural language processing, and game playing, and have become an essential tool in the field of artificial intelligence.<|im_end|>\n', role='assistant', function_call=None, tool_calls=None))], created=1712802591, model='PNU-Infosec/cipher-chiao-32k-v1.5', object='text_completion', system_fingerprint='1.4.5-sha-4ee0a0c', usage=CompletionUsage(completion_tokens=151, prompt_tokens=17, total_tokens=168))

When finish reason is stop sequence, output is incorrect, adding stop sequence at the end of output

But when i add additional stop sequence, the generation finished correctly with correct finish reason but still adding the stop sequence token in the output.

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://10.125.121.102:12314/v1",
    api_key="-"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stop=["<|im_end|>\n", "</s>"],
    max_tokens=4096,
    stream=False
)

print(chat_completion)

Output: ChatCompletion(id='', choices=[Choice(finish_reason='stop_sequence', index=0, logprobs=None, message=ChatCompletionMessage(content=" Deep learning is a subset of machine learning that involves training artificial neural networks to recognize patterns and make decisions. It's based on algorithms that are designed to mimic the structure and function of the human brain, using layers of interconnected nodes to process vast amounts of data and learn from it.\n\nDeep learning has proven to be extremely effective in a wide range of applications, including image and speech recognition, natural language processing, and predictive analytics. This is due in part to the large amounts of data and computing power that are now available, as well as advances in the design of neural network architectures.\n\nOverall, deep learning represents a powerful tool for solving complex problems and making predictions, and its use is likely to continue to grow in the coming years.<|im_end|>\n", role='assistant', function_call=None, tool_calls=None))], created=1712802448, model='PNU-Infosec/cipher-chiao-32k-v1.5', object='text_completion', system_fingerprint='1.4.5-sha-4ee0a0c', usage=CompletionUsage(completion_tokens=157, prompt_tokens=17, total_tokens=174))

Notice the <|im_end|>\n as stop sequence token is still included.

Below is my TGI 1.4.5 docker details:

root@b2834eb643e0:/usr/src# text-generation-launcher --env
2024-04-11T02:34:07.897186Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: 4ee0a0c4010b6e000f176977648aa1749339e8cb
Docker label: sha-4ee0a0c
nvidia-smi:
Thu Apr 11 02:34:07 2024       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.1     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA A100 80G...  On   | 00000000:01:00.0 Off |                    0 |
   | N/A   37C    P0    64W / 300W |  63342MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+

   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   +-----------------------------------------------------------------------------+

Issue

How do i prevent TGI to produce stop sequence on the output? Can I solve this by modifying on TGI side / model repository? Since in my goal is to make it compatible with all the frontend that supports OpenAI API.

Best Regards, Derry

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Steps to Reproduce:

Starts TGI Docker with model that has additional stop sequence
Add stop sequence to the OpenAI API
Generation will stop correctly but still outputs the stop sequence.

Expected behavior

Generation should not output the stop sequence as same as when finish reason is eos_token

AnyISalIn commented 4 months ago

same issue

ibndias commented 4 months ago

In VLLM the stop sequence token is not printed.

drbh commented 4 months ago

this PR fixes a bug with some eos_tokens not being included https://github.com/huggingface/text-generation-inference/pull/1808 and should resolve this issue.

@AnyISalIn and @ibndias would you please try the latest changes see if that resolves the bug.

Please feel free to re open this issue if the problem persists. Thank you! 🙏

ibndias commented 4 months ago

Hi @drbh thanks for the effort but unfortunately the stop sequence is still shown in the output.

Here is the command and code to reproduce with open model.

docker run --gpus "device=0" --rm -e HUGGING_FACE_HUB_TOKEN=token --shm-size 1g -p 12314:80 --name tgi-openhermes -v /data2/derry/.cache/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:2.0.2 --model-id teknium/OpenHermes-2.5-Mistral-7B --cuda-memory-fraction 0.5

from openai import OpenAI
client = OpenAI(
    base_url="http://10.125.121.102:12314/v1",
    api_key="-")

completion = client.chat.completions.create(
  model="openhermes",
  temperature=0.5,
  messages=[
    {"role": "system", "content": "You are instruction follower."},
    {"role": "user", "content": "Say number 1 - 10"}
  ],
  stop=["7"],
  max_tokens=100
)

print(completion.choices[0].message.content)

Result:

1, 2, 3, 4, 5, 6, 7

7 is still shown.

Seems like I there is no reopen issue button should I open as new issue? @drbh

huggingface / text-generation-inference

Message API, The OpenAI Compatible API Still Outputs Stop Sequence #1724

System Info

When finish reason is `eos_token`, output is correct

When finish reason is stop sequence, output is incorrect, adding stop sequence at the end of output

Issue

Information

Tasks

Reproduction

Expected behavior

huggingface / text-generation-inference

Message API, The OpenAI Compatible API Still Outputs Stop Sequence #1724

System Info

When finish reason is eos_token, output is correct

When finish reason is stop sequence, output is incorrect, adding stop sequence at the end of output

Issue

Information

Tasks

Reproduction

Expected behavior

When finish reason is `eos_token`, output is correct