TGI returns empty responses on a random basis with a specific model

JoFrost commented 1 year ago

System Info

Command: text-generation-launcher --model-id Phind/Phind-CodeLlama-34B-v2 --max-input-length 3072 --max-total-tokens 6144 --port 8080
Target: x86_64-unknown-linux-gnu (Ubuntu 20.04)
Cargo version: 1.70.0
Commit sha: 5e28f44a834c20602d4cc18d28703e024d3bbbe0
Docker label: N/A
nvidia-smi:
Mon Oct 23 12:21:19 2023       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA A100 80G...  Off  | 00000001:00:00.0 Off |                    0 |
   | N/A   37C    P0    63W / 300W |  68640MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+

   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |    0   N/A  N/A      1167      G   /usr/lib/xorg/Xorg                130MiB |
   |    0   N/A  N/A      1541      G   /usr/bin/gnome-shell               12MiB |
   |    0   N/A  N/A    155336      C   ...ion-inference3/bin/python    68492MiB |
   +-----------------------------------------------------------------------------+

model_id: "Phind/Phind-CodeLlama-34B-v2",
model_sha: "949f61e203f91b412efe8f679c798f09f0ff4b0c",
model_dtype: "torch.float16",
model_device_type: "cuda",
model_pipeline_tag: "text-generation",
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_input_length: 3072,
max_total_tokens: 6144,
waiting_served_ratio: 1.2,
max_batch_total_tokens: 16000,
max_waiting_tokens: 20,
validation_workers: 2,
version: "1.1.1",
sha: "5e28f44a834c20602d4cc18d28703e024d3bbbe0",
docker_label" null

Information

[ ] Docker
[X] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Send this context on the generate_stream endpoint, using the following parameters:

max_new_tokens: 3072
temperature: 0.8
truncate: 3072
return_full_text: false

I want you to act as Senior full stack developer. You have long experience working with colleagues, and know perfectly well how to handle commits from colleagues.

You are provided a commit request and have to explain what the proposed modifications do.
Here is the original code:
"""
sentry.queue.client
~~~~~~~~~~~~~~~~~~~

:copyright: (c) 2010 by the Sentry Team, see AUTHORS for more details.
:license: BSD, see LICENSE for more details.
"""
from kombu import BrokerConnection
from kombu.common import maybe_declare
from kombu.pools import producers

from sentry.conf import settings
from sentry.queue.queues import task_queues, task_exchange

class Broker(object):
    def __init__(self, config):
        self.connection = BrokerConnection(**config)

    def delay(self, func, *args, **kwargs):
        payload = {
            "func": func,
            "args": args,
            "kwargs": kwargs,
        }

        with producers[self.connection].acquire(block=False) as producer:
            for queue in task_queues:
                maybe_declare(queue, producer.channel)
            producer.publish(payload,
                exchange=task_exchange,
                serializer="pickle",
                compression="bzip2",
                queue='default',
                routing_key='default',
            )

broker = Broker(settings.QUEUE)

Here is the proposed code:
"""
sentry.queue.client
~~~~~~~~~~~~~~~~~~~

:copyright: (c) 2010 by the Sentry Team, see AUTHORS for more details.
:license: BSD, see LICENSE for more details.
"""
from kombu import BrokerConnection
from kombu.common import maybe_declare
from kombu.pools import producers

from sentry.conf import settings
from sentry.queue.queues import task_queues, task_exchange

class Broker(object):
    def __init__(self, config):
        self.connection = BrokerConnection(**config)
        with producers[self.connection].acquire(block=False) as producer:
            for queue in task_queues:
                maybe_declare(queue, producer.channel)

    def delay(self, func, *args, **kwargs):
        payload = {
            "func": func,
            "args": args,
            "kwargs": kwargs,
        }

        with producers[self.connection].acquire(block=False) as producer:
            producer.publish(payload,
                exchange=task_exchange,
                serializer="pickle",
                compression="bzip2",
                queue='default',
                routing_key='default',
            )

broker = Broker(settings.QUEUE)

Here is the commit message:
Declare queues when broker is instantiated

Sometimes, the response will be the one expected, sometimes, the response will be empty. Here's what shows up when the response is empty, on server-side :

2023-10-23T12:09:26.059633Z  INFO generate_stream{parameters=GenerateParameters { best_of: None, temperature: Some(0.8), repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(3072), return_full_text: Some(false), stop: [], truncate: Some(3072), watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None } total_time="316.239404ms" validation_time="98.524µs" queue_time="88.316µs" inference_time="316.052744ms" time_per_token="316.052744ms" seed="Some(14999157372453187274)"}: text_generation_router::server: router/src/server.rs:457: Success

The inference time is abnormally low, compared to when the inference is successfully done:

2023-10-23T12:09:21.008555Z  INFO generate_stream{parameters=GenerateParameters { best_of: None, temperature: Some(0.8), repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(3072), return_full_text: Some(false), stop: [], truncate: Some(3072), watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None } total_time="4.823720586s" validation_time="129.241µs" queue_time="81.368µs" inference_time="4.823510278s" time_per_token="61.057091ms" seed="Some(6198214712631710503)"}: text_generation_router::server: router/src/server.rs:457: Success

Expected behavior

The server should return an answer all the time, and not randomly send empty answers.

OlivierDehaene commented 1 year ago

It seems that you are using temperature = 0.8 so a sampling strategy. Can you make sure that for a given seed that works, the model always outputs an answer and the other way around? If that's the case, TGI is working properly and it is only because of the sampling strategy that you do not get an answer.

jrsperry commented 11 months ago

I'm getting the same behavior although i've run it with 2 different models. I've run with specifying a variety of temperatures, as well as leaving it unspecified. I'm using version 1.1.0 of the TGI and using the TGI client to make requests. If i provide input of only a single sentence, then I tend to reliably get results, but this is bizarre.

Models tested: mistralai/Mistral-7B-Instruct-v0.1, togethercomputer/Llama-2-7B-32K-Instruct

randxie commented 11 months ago

Probably not directly related, but I recently experience TGI returning empty generated_text while the "tokens" field inside the response object are not empty. For those who are seeing empty "generated_text" returned, consider checking if you can find anything in the "tokens" field.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / text-generation-inference