Closed amihalik closed 4 months ago
Note that other settings work fine. Note the modifications to max-input-length
and max-batch-prefill-tokens
below:
Works:
docker run -it --rm -p 8080:80 --gpus all --name tgi \
-v /dev/shm/models:/models --shm-size 2g -e CUDA_LAUNCH_BLOCKING=1 \
ghcr.io/huggingface/text-generation-inference:1.0.3 \
--model-id /models/lmsys/vicuna-7b-v1.5-16k/ \
--num-shard 4 \
--rope-scaling=linear --rope-factor=4.0 \
--max-input-length=14000 \
--max-batch-prefill-tokens=14000 \
--max-total-tokens=16000
Fails:
docker run -it --rm -p 8080:80 --gpus all --name tgi \
-v /dev/shm/models:/models --shm-size 2g -e CUDA_LAUNCH_BLOCKING=1 \
ghcr.io/huggingface/text-generation-inference:1.0.3 \
--model-id /models/lmsys/vicuna-7b-v1.5-16k/ \
--num-shard 4 \
--rope-scaling=linear --rope-factor=4.0 \
--max-input-length=15000 \
--max-batch-prefill-tokens=15001 \
--max-total-tokens=16000
Works:
docker run -it --rm -p 8080:80 --gpus all --name tgi \
-v /dev/shm/models:/models --shm-size 2g -e CUDA_LAUNCH_BLOCKING=1 \
ghcr.io/huggingface/text-generation-inference:1.0.3 \
--model-id /models/lmsys/vicuna-7b-v1.5-16k/ \
--num-shard 4 \
--rope-scaling=linear --rope-factor=4.0 \
--max-input-length=15000 \
--max-batch-prefill-tokens=16000 \
--max-total-tokens=16000
I was able to reproduce. Couldn't find easily what triggered it, but it only occurs because this model doesn't have the fast tokenizer, meaning the rust router cannot see the number of tokens per request which kind of messes up the scheduler
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
System Info
The full command line used that causes issues:
Model being used:
lmsys/vicuna-7b-v1.5-16k/
OS version:
Deep Learning AMI GPU PyTorch 2.0.1 (Amazon Linux 2) 20230627 ami-051619310404cab17
Hardware used:
AWS g5.12xlarge. 4xNVIDIA A10G
The current version being used:
1.0.3 ( "docker_label": "sha-5485c14")
Information
Tasks
Reproduction
docker run
command above.2023-09-10T18:20:51.270263Z INFO download: text_generation_launcher: Successfully downloaded weights. 2023-09-10T18:20:51.270469Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2023-09-10T18:20:51.270961Z INFO shard-manager: text_generation_launcher: Starting shard rank=3 2023-09-10T18:20:51.270553Z INFO shard-manager: text_generation_launcher: Starting shard rank=1 2023-09-10T18:20:51.270977Z INFO shard-manager: text_generation_launcher: Starting shard rank=2 2023-09-10T18:20:58.393577Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2023-09-10T18:20:58.399344Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2
2023-09-10T18:20:58.407210Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3
2023-09-10T18:20:58.422751Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2023-09-10T18:20:58.477899Z INFO shard-manager: text_generation_launcher: Shard ready in 7.2063218s rank=0 2023-09-10T18:20:58.478280Z INFO shard-manager: text_generation_launcher: Shard ready in 7.205919795s rank=3 2023-09-10T18:20:58.478531Z INFO shard-manager: text_generation_launcher: Shard ready in 7.206058467s rank=2 2023-09-10T18:20:58.478684Z INFO shard-manager: text_generation_launcher: Shard ready in 7.206132838s rank=1 2023-09-10T18:20:58.576818Z INFO text_generation_launcher: Starting Webserver 2023-09-10T18:20:58.583979Z WARN text_generation_router: router/src/main.rs:166: Could not find a fast tokenizer implementation for /models/lmsys/vicuna-7b-v1.5-16k/ 2023-09-10T18:20:58.584007Z WARN text_generation_router: router/src/main.rs:169: Rust input length validation and truncation is disabled 2023-09-10T18:20:58.584010Z WARN text_generation_router: router/src/main.rs:194: no pipeline tag found for model /models/lmsys/vicuna-7b-v1.5-16k/ 2023-09-10T18:20:58.588523Z INFO text_generation_router: router/src/main.rs:213: Warming up model 2023-09-10T18:21:04.166192Z INFO text_generation_router: router/src/main.rs:246: Setting max batch total tokens to 109200 2023-09-10T18:21:04.166222Z INFO text_generation_router: router/src/main.rs:247: Connected 2023-09-10T18:21:04.166228Z WARN text_generation_router: router/src/main.rs:252: Invalid hostname, defaulting to 0.0.0.0
{ "model_id": "/models/lmsys/vicuna-7b-v1.5-16k/", "model_sha": null, "model_dtype": "torch.float16", "model_device_type": "cuda", "model_pipeline_tag": null, "max_concurrent_requests": 128, "max_best_of": 2, "max_stop_sequences": 4, "max_input_length": 15000, "max_total_tokens": 16000, "waiting_served_ratio": 1.2, "max_batch_total_tokens": 109200, "max_waiting_tokens": 20, "validation_workers": 2, "version": "1.0.3", "sha": "5485c142e87a182f2eee713a6b056ee38bd901f5", "docker_label": "sha-5485c14" }