Closed arnavsinghvi11 closed 1 year ago
What is the speed of your disk where the model is stored? We load the 70B in 50 secs in our prod.
it actually tried to connect now after a few more attempts but I run into this issue now with the same command as above
RuntimeError: shape '[-1, 3, 16, 128]' is invalid for input of size 10485760 rank=1 2023-07-21T18:11:21.572989Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=16000}:warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=16000}: text_generation_client: router/client/src/lib.rs:33: Server error: shape '[-1, 3, 16, 128]' is invalid for input of size 10485760 thread 'main' panicked at 'Unable to warmup model: Generation("shape '[-1, 3, 16, 128]' is invalid for input of size 10485760")
Are there any specific parameter constraints for running llama2-70hf on the server?
You need to use 0.9.3.
Great that works now! thanks!
@OlivierDehaene can you describe please why 0.9.3 works and latest not?
@germanjke can you describe please your issue on latest?
@OlivierDehaene I have issue on latest and it's works fine on 0.9.3 I want to know what is difference between them thanks~
@germanjke can you describe please your issue on latest? For example, a stack trace?
@OlivierDehaene
Not enough memory to handle 16000 total tokens with 2 prefill tokens. You need to decrease --max-batch-total-tokens or
--max-batch-prefill-tokens`
with llama RuntimeError: shape '[-1, 3, 8, 128]' is invalid for input of size 5242880
but in 0.9.3 everything is ok
Try --pull always
before the image name.
System Info
OS Version: Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal
8 A-100 GPUS
Information
Tasks
Reproduction
I'm running
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data -e HUGGING_FACE_HUB_TOKEN='mytoken' ghcr.io/huggingface/text-generation-inference:0.9 --model-id $model --num-shard $num_shard --max-input-length 4000 --max-total-tokens 4096
on 8 shards for my 8 A100 gpus.I'm running this on the latest commit: 1da642bd0e6de28ef499f17cd226264f3ccdc824
Expected behavior
I'm running this command with the right huggingface token access and after having downloaded all the model weights but I face the issue of having the shards wait for a long time without connecting to the model. I've made sure that all the 8 gpus are free to run this but it still doesn't connect to the server after a long time. Would appreciate any help or optimized approaches to get this to run!