huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.08k stars 1.07k forks source link

Llama 2 70-hf not loading on server - waiting for shards #676

Closed arnavsinghvi11 closed 1 year ago

arnavsinghvi11 commented 1 year ago

System Info

OS Version: Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

8 A-100 GPUS

Information

Tasks

Reproduction

I'm running docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data -e HUGGING_FACE_HUB_TOKEN='mytoken' ghcr.io/huggingface/text-generation-inference:0.9 --model-id $model --num-shard $num_shard --max-input-length 4000 --max-total-tokens 4096 on 8 shards for my 8 A100 gpus.

I'm running this on the latest commit: 1da642bd0e6de28ef499f17cd226264f3ccdc824

Expected behavior

I'm running this command with the right huggingface token access and after having downloaded all the model weights but I face the issue of having the shards wait for a long time without connecting to the model. I've made sure that all the 8 gpus are free to run this but it still doesn't connect to the server after a long time. Would appreciate any help or optimized approaches to get this to run!

OlivierDehaene commented 1 year ago

What is the speed of your disk where the model is stored? We load the 70B in 50 secs in our prod.

arnavsinghvi11 commented 1 year ago

it actually tried to connect now after a few more attempts but I run into this issue now with the same command as above

RuntimeError: shape '[-1, 3, 16, 128]' is invalid for input of size 10485760 rank=1 2023-07-21T18:11:21.572989Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=16000}:warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=16000}: text_generation_client: router/client/src/lib.rs:33: Server error: shape '[-1, 3, 16, 128]' is invalid for input of size 10485760 thread 'main' panicked at 'Unable to warmup model: Generation("shape '[-1, 3, 16, 128]' is invalid for input of size 10485760")

Are there any specific parameter constraints for running llama2-70hf on the server?

OlivierDehaene commented 1 year ago

You need to use 0.9.3.

arnavsinghvi11 commented 1 year ago

Great that works now! thanks!

germanjke commented 1 year ago

@OlivierDehaene can you describe please why 0.9.3 works and latest not?

OlivierDehaene commented 1 year ago

@germanjke can you describe please your issue on latest?

germanjke commented 1 year ago

@OlivierDehaene I have issue on latest and it's works fine on 0.9.3 I want to know what is difference between them thanks~

OlivierDehaene commented 1 year ago

@germanjke can you describe please your issue on latest? For example, a stack trace?

germanjke commented 1 year ago

@OlivierDehaene

Not enough memory to handle 16000 total tokens with 2 prefill tokens. You need to decrease --max-batch-total-tokens or--max-batch-prefill-tokens`

with llama RuntimeError: shape '[-1, 3, 8, 128]' is invalid for input of size 5242880

but in 0.9.3 everything is ok

OlivierDehaene commented 1 year ago

Try --pull always before the image name.