huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.03k stars 1.06k forks source link

One of two concurrent request generating empty text (Mistral 7B) #1776

Closed TysonHeart closed 5 months ago

TysonHeart commented 6 months ago

System Info

Running TGI docker with command

docker run --rm --gpus all --ipc=host -p 8080:80 -v /root/.cache/huggingface/hub:/data -e HF_API_TOKEN=hf_XXXX ghcr.io/huggingface/text-generation-inference:latest --hostname 0.0.0.0 --model-id mistralai/Mistral-7B-Instruct-v0.2 --num-shard 2 --max-input-length 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 5120 --max-batch-size 4

2024-04-19T16:33:32.585587Z INFO text_generation_launcher: Args { model_id: "mistralai/Mistral-7B-Instruct-v0.2", revision: None, validation_workers: 2, sharded: None, num_shard: Some(2), quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurr ent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(4096), max_total_tokens: Some(8192), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some(5120), max_batch_total_tokens: None, max_waiting_to kens: 20, max_batch_size: Some(4), cuda_graphs: None, hostname: "0.0.0.0", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: fa lse, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable _grammar_support: false, env: false, max_client_batch_size: 4 }

From inside the container: `root@c6899067128f:/usr/src# text-generation-launcher --env 2024-04-19T16:44:08.530832Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.75.0 Commit sha: 2d0a7173d4891e7cd5f9b77f8e0987b82a339e51 Docker label: sha-2d0a717 nvidia-smi: Fri Apr 19 16:44:08 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A10 Off | 00000000:00:04.0 Off | 0 | | 0% 41C P0 58W / 150W | 20586MiB / 23028MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A10 Off | 00000000:00:06.0 Off | 0 | | 0% 42C P0 58W / 150W | 20586MiB / 23028MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ 2024-04-19T16:44:08.530872Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "c6899067128f", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4 }`

Information

Tasks

Reproduction

Launch docker image as described above. Vanilla stuff.

Open a couple of terminal windows and fire curl commands (by hand so almost) concurrently.

Terminal 1: curl http://my_gpu_machine:8080/generate \ --header "Content-Type: application/json" \ --data '{ "inputs": "Building a website can be done in 10 simple steps", "parameters": { "temperature": 0.8, "repetition_penalty": 1.0, "seed": 6198214712631710503, "max_new_tokens": 100, "frequency_penalty": 1.0}}'

Output: {"generated_text":":\nStep One - Choosing Your Website Platform and Domain Name. The first step is to choose your web hosting platform, such as WordPress or Wix/Squarespace etc., along with registering for the domain name that best suits you (websitename..com). Both of these tasks are usually offered together by companies like GoDaddy when building on their respective platforms so it's easy enough! Make sure this combination fits well before moving forward though because changing domains later"}%

Terminal 2: curl http://my_gpu_machine:8080/generate \ --header "Content-Type: application/json" \ --data '{ "inputs": "What is the answer to life and all?", "parameters": { "temperature": 0.8, "repetition_penalty": 1.0, "seed": 6198214714631710518, "max_new_tokens": 100, "frequency_penalty": 1.0}}'

Output: {"generated_text":"\n"}%

This happens everytime, very easily reproducible.

Expected behavior

As you can see, the first API call returns with generated text whereas the second one just comes back with \n (or sometimes a period (.) or sometimes a colon (:). I expect both API calls to return generated text.

This seems to be very basic functionality. Am I missing something obvious here?

alexgravx commented 6 months ago

I have a similar issue with the /generate API endpoint, and Llama2 model (meta-llama/Llama-2-7b-chat-hf). I am using asynchronous requests in python using asyncio and aiohttp.

Here is my code (you'll have to set env variables).

async def post(string, session, temperature=0.7, max_new_tokens=50):
        # Load env variables
        load_dotenv()
        # Set url
        url = os.getenv("SERVER_IP") + "/generate"
        # Set headers
        headers = {
            "Content-Type": "application/json",
        }
        # Set data
        data = {
            "inputs": string,
            "parameters": {
                "temperature": temperature,
                "max_new_tokens": max_new_tokens,
            },
        }

        # Asynchronous request
        async with session.post(url=url, headers=headers, json=data) as response:
            resp = await response.json()
            return resp.get('generated_text')

async def main(String_List):
    async with aiohttp.ClientSession() as session:
        responses = await asyncio.gather(*(post(string, session) for string in String_List))
    return responses

asyncio.run(main(String_List))

This issue seems to happen when the server doesn't get the requests at the exact same time.

I don't have the issue with 2 simple simultaneous request, with String_List = ["Why is the sky blue ?", "Does magic exist?"] Here is the result:

["\n\nThe sky appears blue because of a phenomenon called Rayleigh scattering, which occurs when sunlight travels through the Earth's atmosphere. Blue light, which has a shorter wavelength, is scattered more than other colors,", " I don't know, but I do know that it's a powerful force that has captured the imagination of people for centuries. If you believe in magic, then you know that it's not just a trick or a illusion, but"]

Here is the server logs. We can see that the 2 request arrived at the exact same time:

2024-05-02T18:46:14.841112Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="3.03429266s" validation_time="119.75µs" queue_time="24.621µs" inference_time="3.034148508s" time_per_token="60.68297ms" seed="Some(12312515242247638074)"}: text_generation_router::server: router/src/server.rs:322: Success
2024-05-02T18:46:14.841141Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="3.013721041s" validation_time="113.869µs" queue_time="41.184335ms" inference_time="2.972423037s" time_per_token="59.44846ms" seed="Some(2428907862398705128)"}: text_generation_router::server: router/src/server.rs:322: Success

However, I am making more complex request with a RAG, and I am there getting the same issue as you. Here is the response, the first one is empty:

['', '\n\n Pour répondre à cela, nous allons procéder à une analyse des différentesapproches et formulations couramment utilisées pour le dimensionnement et l’évaluation d’architecturesavion. Nous all']

What is interesting is the logs. As you can see, the requests don't arrive at the same time (a few seconds apart). Because my RAG is complex and the input prompt is bigger, I think that it may be causing that delay. You can also see that the time per token his higher with the empty return (436ms vs 77ms which is the standard time for this model):

2024-05-02T18:45:09.400725Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="900.637516ms" validation_time="906.888µs" queue_time="463.669762ms" inference_time="436.061055ms" time_per_token="436.061055ms" seed="Some(6510953659954175863)"}: text_generation_router::server: router/src/server.rs:322: Success
2024-05-02T18:45:12.363474Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="3.865515768s" validation_time="934.667µs" queue_time="31.45µs" inference_time="3.864549831s" time_per_token="77.290996ms" seed="Some(4827920659267836322)"}: text_generation_router::server: router/src/server.rs:322: Success

When I'm making 6 simultaneous requests, I have the same issue: the first response is empty with a higher time per token (about 500ms), while the 5 other are standard with about 70ms time per token.

I then changed the model to Mistral (mistralai/Mistral-7B-Instruct-v0.2) and the empty first string disappear, I can't tell why... However, the requests arrived at the same time, we can thus linked the issue to the time of arrival of the requests.

Logs with Mistral:

2024-05-02T19:45:56.665190Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="6.579128639s" validation_time="2.683284ms" queue_time="23.591µs" inference_time="6.576421954s" time_per_token="131.528439ms" seed="Some(8042501124986571198)"}: text_generation_router::server: router/src/server.rs:322: Success
2024-05-02T19:45:56.665218Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(50), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="6.575659955s" validation_time="827.079µs" queue_time="474.720856ms" inference_time="6.10011227s" time_per_token="122.002245ms" seed="Some(18189346514334242789)"}: text_generation_router::server: router/src/server.rs:322: Success
github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.