Llama 2 70-hf not loading on server - waiting for shards

arnavsinghvi11 commented 1 year ago

System Info

OS Version: Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

8 A-100 GPUS

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

I'm running docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data -e HUGGING_FACE_HUB_TOKEN='mytoken' ghcr.io/huggingface/text-generation-inference:0.9 --model-id $model --num-shard $num_shard --max-input-length 4000 --max-total-tokens 4096 on 8 shards for my 8 A100 gpus.

I'm running this on the latest commit: 1da642bd0e6de28ef499f17cd226264f3ccdc824

Expected behavior

I'm running this command with the right huggingface token access and after having downloaded all the model weights but I face the issue of having the shards wait for a long time without connecting to the model. I've made sure that all the 8 gpus are free to run this but it still doesn't connect to the server after a long time. Would appreciate any help or optimized approaches to get this to run!

OlivierDehaene commented 1 year ago

What is the speed of your disk where the model is stored? We load the 70B in 50 secs in our prod.

sharded on 4xA100 80GB
disk maximum read/write throughput of 2.7 GB/s

arnavsinghvi11 commented 1 year ago

it actually tried to connect now after a few more attempts but I run into this issue now with the same command as above

RuntimeError: shape '[-1, 3, 16, 128]' is invalid for input of size 10485760 rank=1 2023-07-21T18:11:21.572989Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=16000}:warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=16000}: text_generation_client: router/client/src/lib.rs:33: Server error: shape '[-1, 3, 16, 128]' is invalid for input of size 10485760 thread 'main' panicked at 'Unable to warmup model: Generation("shape '[-1, 3, 16, 128]' is invalid for input of size 10485760")

Are there any specific parameter constraints for running llama2-70hf on the server?

OlivierDehaene commented 1 year ago

You need to use 0.9.3.

arnavsinghvi11 commented 1 year ago

Great that works now! thanks!

germanjke commented 1 year ago

@OlivierDehaene can you describe please why 0.9.3 works and latest not?

OlivierDehaene commented 1 year ago

@germanjke can you describe please your issue on latest?

germanjke commented 1 year ago

@OlivierDehaene I have issue on latest and it's works fine on 0.9.3 I want to know what is difference between them thanks~

OlivierDehaene commented 1 year ago

@germanjke can you describe please your issue on latest? For example, a stack trace?

germanjke commented 1 year ago

@OlivierDehaene

Not enough memory to handle 16000 total tokens with 2 prefill tokens. You need to decrease --max-batch-total-tokens or--max-batch-prefill-tokens`

with llama RuntimeError: shape '[-1, 3, 8, 128]' is invalid for input of size 5242880

but in 0.9.3 everything is ok

OlivierDehaene commented 1 year ago

Try --pull always before the image name.

huggingface / text-generation-inference