TGI unresponsive at `max-input-length` settings

amihalik commented 10 months ago

System Info

The full command line used that causes issues:

docker run -it --rm -p 8080:80 --gpus all --name tgi \
  -v /dev/shm/models:/models --shm-size 2g -e CUDA_LAUNCH_BLOCKING=1 \
  ghcr.io/huggingface/text-generation-inference:1.0.3 \
  --model-id /models/lmsys/vicuna-7b-v1.5-16k/ \
  --num-shard 4 \
  --rope-scaling=linear --rope-factor=4.0 \
  --max-input-length=15000 \
  --max-batch-prefill-tokens=15000 \
  --max-total-tokens=16000

Model being used: lmsys/vicuna-7b-v1.5-16k/

OS version: Deep Learning AMI GPU PyTorch 2.0.1 (Amazon Linux 2) 20230627 ami-051619310404cab17

Hardware used: AWS g5.12xlarge. 4xNVIDIA A10G

The current version being used: 1.0.3 ( "docker_label": "sha-5485c14")

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Launch docker container using docker run command above.

Wait until container launch complete, eg:


2023-09-10T18:20:48.267361Z  INFO text_generation_launcher: Args { model_id: "/models/lmsys/vicuna-7b-v1.5-16k/", revision: None, validation_workers: 2, sharded: None, num_shard: Some(4), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 15000, max_total_tokens: 16000, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 15000, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "834dd5b47195", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: Some(Linear), rope_factor: Some(4.0), json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-09-10T18:20:48.267403Z  INFO text_generation_launcher: Sharding model on 4 processes
2023-09-10T18:20:48.267478Z  INFO download: text_generation_launcher: Starting download process.
2023-09-10T18:20:50.815981Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-09-10T18:20:51.270263Z INFO download: text_generation_launcher: Successfully downloaded weights. 2023-09-10T18:20:51.270469Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2023-09-10T18:20:51.270961Z INFO shard-manager: text_generation_launcher: Starting shard rank=3 2023-09-10T18:20:51.270553Z INFO shard-manager: text_generation_launcher: Starting shard rank=1 2023-09-10T18:20:51.270977Z INFO shard-manager: text_generation_launcher: Starting shard rank=2 2023-09-10T18:20:58.393577Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1

2023-09-10T18:20:58.399344Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2

2023-09-10T18:20:58.407210Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3

2023-09-10T18:20:58.422751Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2023-09-10T18:20:58.477899Z INFO shard-manager: text_generation_launcher: Shard ready in 7.2063218s rank=0 2023-09-10T18:20:58.478280Z INFO shard-manager: text_generation_launcher: Shard ready in 7.205919795s rank=3 2023-09-10T18:20:58.478531Z INFO shard-manager: text_generation_launcher: Shard ready in 7.206058467s rank=2 2023-09-10T18:20:58.478684Z INFO shard-manager: text_generation_launcher: Shard ready in 7.206132838s rank=1 2023-09-10T18:20:58.576818Z INFO text_generation_launcher: Starting Webserver 2023-09-10T18:20:58.583979Z WARN text_generation_router: router/src/main.rs:166: Could not find a fast tokenizer implementation for /models/lmsys/vicuna-7b-v1.5-16k/ 2023-09-10T18:20:58.584007Z WARN text_generation_router: router/src/main.rs:169: Rust input length validation and truncation is disabled 2023-09-10T18:20:58.584010Z WARN text_generation_router: router/src/main.rs:194: no pipeline tag found for model /models/lmsys/vicuna-7b-v1.5-16k/ 2023-09-10T18:20:58.588523Z INFO text_generation_router: router/src/main.rs:213: Warming up model 2023-09-10T18:21:04.166192Z INFO text_generation_router: router/src/main.rs:246: Setting max batch total tokens to 109200 2023-09-10T18:21:04.166222Z INFO text_generation_router: router/src/main.rs:247: Connected 2023-09-10T18:21:04.166228Z WARN text_generation_router: router/src/main.rs:252: Invalid hostname, defaulting to 0.0.0.0

4. Verify server is responsive: `curl 'http://localhost:8080/info' | jq` response:

{ "model_id": "/models/lmsys/vicuna-7b-v1.5-16k/", "model_sha": null, "model_dtype": "torch.float16", "model_device_type": "cuda", "model_pipeline_tag": null, "max_concurrent_requests": 128, "max_best_of": 2, "max_stop_sequences": 4, "max_input_length": 15000, "max_total_tokens": 16000, "waiting_served_ratio": 1.2, "max_batch_total_tokens": 109200, "max_waiting_tokens": 20, "validation_workers": 2, "version": "1.0.3", "sha": "5485c142e87a182f2eee713a6b056ee38bd901f5", "docker_label": "sha-5485c14" }


6. Query the model `curl -X 'POST'  'http://localhost:8080/generate'  -H 'Content-Type: application/json'  -d '{"inputs": "My name is Olivier and I"}' --max-time 10`
7. Wait for curl to timeout, eg: `curl: (28) Operation timed out after 10001 milliseconds with 0 bytes received`

### Expected behavior

Generated text response, eg: `{"generated_text":"am a French photographer based in Paris. nobody is perfect, but I try to be the best"}`

amihalik commented 10 months ago

Note that other settings work fine. Note the modifications to max-input-length and max-batch-prefill-tokens below:

Works:

docker run -it --rm -p 8080:80 --gpus all --name tgi \
  -v /dev/shm/models:/models --shm-size 2g -e CUDA_LAUNCH_BLOCKING=1 \
  ghcr.io/huggingface/text-generation-inference:1.0.3 \
  --model-id /models/lmsys/vicuna-7b-v1.5-16k/ \
  --num-shard 4 \
  --rope-scaling=linear --rope-factor=4.0 \
  --max-input-length=14000 \
  --max-batch-prefill-tokens=14000 \
  --max-total-tokens=16000

Fails:

docker run -it --rm -p 8080:80 --gpus all --name tgi \
  -v /dev/shm/models:/models --shm-size 2g -e CUDA_LAUNCH_BLOCKING=1 \
  ghcr.io/huggingface/text-generation-inference:1.0.3 \
  --model-id /models/lmsys/vicuna-7b-v1.5-16k/ \
  --num-shard 4 \
  --rope-scaling=linear --rope-factor=4.0 \
  --max-input-length=15000 \
  --max-batch-prefill-tokens=15001 \
  --max-total-tokens=16000

Works:

docker run -it --rm -p 8080:80 --gpus all --name tgi \
  -v /dev/shm/models:/models --shm-size 2g -e CUDA_LAUNCH_BLOCKING=1 \
  ghcr.io/huggingface/text-generation-inference:1.0.3 \
  --model-id /models/lmsys/vicuna-7b-v1.5-16k/ \
  --num-shard 4 \
  --rope-scaling=linear --rope-factor=4.0 \
  --max-input-length=15000 \
  --max-batch-prefill-tokens=16000 \
  --max-total-tokens=16000

Narsil commented 9 months ago

I was able to reproduce. Couldn't find easily what triggered it, but it only occurs because this model doesn't have the fast tokenizer, meaning the rust router cannot see the number of tokens per request which kind of messes up the scheduler

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / text-generation-inference