One of two concurrent request generating empty text (Mistral 7B)

System Info

Running TGI docker with command

docker run --rm --gpus all --ipc=host -p 8080:80 -v /root/.cache/huggingface/hub:/data -e HF_API_TOKEN=hf_XXXX ghcr.io/huggingface/text-generation-inference:latest --hostname 0.0.0.0 --model-id mistralai/Mistral-7B-Instruct-v0.2 --num-shard 2 --max-input-length 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 5120 --max-batch-size 4

2024-04-19T16:33:32.585587Z INFO text_generation_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.2", revision: None, validation_workers: 2, sharded: None, num_shard: Some(2), quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurr ent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(4096), max_total_tokens: Some(8192), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some(5120), max_batch_total_tokens: None, max_waiting_to kens: 20, max_batch_size: Some(4), cuda_graphs: None, hostname: "0.0.0.0", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: fa lse, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable _grammar_support: false, env: false, max_client_batch_size: 4 }

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ 2024-04-19T16:44:08.530872Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "c6899067128f", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4 }`

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Launch docker image as described above. Vanilla stuff.

Open a couple of terminal windows and fire curl commands (by hand so almost) concurrently.

Terminal 1: curl http://my_gpu_machine:8080/generate \ --header "Content-Type: application/json" \ --data '{ "inputs": "Building a website can be done in 10 simple steps", "parameters": { "temperature": 0.8, "repetition_penalty": 1.0, "seed": 6198214712631710503, "max_new_tokens": 100, "frequency_penalty": 1.0}}'

Output: {"generated_text":":\nStep One - Choosing Your Website Platform and Domain Name. The first step is to choose your web hosting platform, such as WordPress or Wix/Squarespace etc., along with registering for the domain name that best suits you (websitename..com). Both of these tasks are usually offered together by companies like GoDaddy when building on their respective platforms so it's easy enough! Make sure this combination fits well before moving forward though because changing domains later"}%

Terminal 2: curl http://my_gpu_machine:8080/generate \ --header "Content-Type: application/json" \ --data '{ "inputs": "What is the answer to life and all?", "parameters": { "temperature": 0.8, "repetition_penalty": 1.0, "seed": 6198214714631710518, "max_new_tokens": 100, "frequency_penalty": 1.0}}'

Output: {"generated_text":"\n"}%

This happens everytime, very easily reproducible.

Expected behavior

As you can see, the first API call returns with generated text whereas the second one just comes back with \n (or sometimes a period (.) or sometimes a colon (:). I expect both API calls to return generated text.

This seems to be very basic functionality. Am I missing something obvious here?

I have a similar issue with the /generate API endpoint, and Llama2 model (meta-llama/Llama-2-7b-chat-hf). I am using asynchronous requests in python using asyncio and aiohttp.

Here is my code (you'll have to set env variables).

async def post(string, session, temperature=0.7, max_new_tokens=50):
        # Load env variables
        load_dotenv()
        # Set url
        url = os.getenv("SERVER_IP") + "/generate"
        # Set headers
        headers = {
            "Content-Type": "application/json",
        }
        # Set data
        data = {
            "inputs": string,
            "parameters": {
                "temperature": temperature,
                "max_new_tokens": max_new_tokens,
            },
        }

        # Asynchronous request
        async with session.post(url=url, headers=headers, json=data) as response:
            resp = await response.json()
            return resp.get('generated_text')

async def main(String_List):
    async with aiohttp.ClientSession() as session:
        responses = await asyncio.gather(*(post(string, session) for string in String_List))
    return responses

asyncio.run(main(String_List))

This issue seems to happen when the server doesn't get the requests at the exact same time.

I don't have the issue with 2 simple simultaneous request, with String_List = ["Why is the sky blue ?", "Does magic exist?"] Here is the result:

["\n\nThe sky appears blue because of a phenomenon called Rayleigh scattering, which occurs when sunlight travels through the Earth's atmosphere. Blue light, which has a shorter wavelength, is scattered more than other colors,", " I don't know, but I do know that it's a powerful force that has captured the imagination of people for centuries. If you believe in magic, then you know that it's not just a trick or a illusion, but"]

Here is the server logs. We can see that the 2 request arrived at the exact same time:

2024-05-02T18:46:14.841112Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="3.03429266s" validation_time="119.75µs" queue_time="24.621µs" inference_time="3.034148508s" time_per_token="60.68297ms" seed="Some(12312515242247638074)"}: text_generation_router::server: router/src/server.rs:322: Success
2024-05-02T18:46:14.841141Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="3.013721041s" validation_time="113.869µs" queue_time="41.184335ms" inference_time="2.972423037s" time_per_token="59.44846ms" seed="Some(2428907862398705128)"}: text_generation_router::server: router/src/server.rs:322: Success

However, I am making more complex request with a RAG, and I am there getting the same issue as you. Here is the response, the first one is empty:

['', '\n\n Pour répondre à cela, nous allons procéder à une analyse des différentesapproches et formulations couramment utilisées pour le dimensionnement et l’évaluation d’architecturesavion. Nous all']

What is interesting is the logs. As you can see, the requests don't arrive at the same time (a few seconds apart). Because my RAG is complex and the input prompt is bigger, I think that it may be causing that delay. You can also see that the time per token his higher with the empty return (436ms vs 77ms which is the standard time for this model):

2024-05-02T18:45:09.400725Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="900.637516ms" validation_time="906.888µs" queue_time="463.669762ms" inference_time="436.061055ms" time_per_token="436.061055ms" seed="Some(6510953659954175863)"}: text_generation_router::server: router/src/server.rs:322: Success
2024-05-02T18:45:12.363474Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="3.865515768s" validation_time="934.667µs" queue_time="31.45µs" inference_time="3.864549831s" time_per_token="77.290996ms" seed="Some(4827920659267836322)"}: text_generation_router::server: router/src/server.rs:322: Success

When I'm making 6 simultaneous requests, I have the same issue: the first response is empty with a higher time per token (about 500ms), while the 5 other are standard with about 70ms time per token.

I then changed the model to Mistral (mistralai/Mistral-7B-Instruct-v0.2) and the empty first string disappear, I can't tell why... However, the requests arrived at the same time, we can thus linked the issue to the time of arrival of the requests.

Logs with Mistral:

2024-05-02T19:45:56.665190Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="6.579128639s" validation_time="2.683284ms" queue_time="23.591µs" inference_time="6.576421954s" time_per_token="131.528439ms" seed="Some(8042501124986571198)"}: text_generation_router::server: router/src/server.rs:322: Success
2024-05-02T19:45:56.665218Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="6.575659955s" validation_time="827.079µs" queue_time="474.720856ms" inference_time="6.10011227s" time_per_token="122.002245ms" seed="Some(18189346514334242789)"}: text_generation_router::server: router/src/server.rs:322: Success

huggingface / text-generation-inference