huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.15k stars 1.08k forks source link

The same model, but different loading methods will result in very different inference speeds? #2757

Open hjs2027864933 opened 6 days ago

hjs2027864933 commented 6 days ago

System Info

TGI version latest;single NVIDIA GeForce RTX 3090;

Information

Tasks

Reproduction

The first loading method (loading llama3 8B model from Hugging face):

model=meta-llama/Meta-Llama-3-8B-Instruct
volume=/home/data/Project/model # share a volume with the Docker container to avoid downloading weights every run
sudo docker run -it --name tgi_llama3_8B --restart=unless-stopped --shm-size 48g -p 3002:80 --runtime "nvidia" --gpus '"device=1"' -v $volume:/data \
    -e HF_TOKEN=$token \
    -e HF_ENDPOINT="https://hf-mirror.com" \
    -e HF_HUB_ENABLE_HF_TRANSFER=False \
    -e USE_FLASH_ATTENTION=False \
    -e HF_HUB_OFFLINE=1 \
    ghcr.chenby.cn/huggingface/text-generation-inference:latest \
    --model-id $model 

The second loading method (loading llama3 8B model from local directory):

model=/data/ans_model/meta-llama/Meta-Llama-3-8B-Instruct
volume=/home/data/Project/model # share a volume with the Docker container to avoid downloading weights every run
sudo docker run -it --name tgi_llama3_8B --restart=unless-stopped --shm-size 48g -p 3002:80 --runtime "nvidia" --gpus '"device=1"' -v $volume:/data \
    -e HF_TOKEN=$token \
    -e HF_ENDPOINT="https://hf-mirror.com" \
    -e HF_HUB_ENABLE_HF_TRANSFER=False \
    -e USE_FLASH_ATTENTION=False \
    -e HF_HUB_OFFLINE=1 \
    ghcr.chenby.cn/huggingface/text-generation-inference:latest \
    --model-id $model 

Expected behavior

The inference speed of the llama3 8B model loaded from Hugging face is much faster than that loaded from the local directory. I don't know why this happens, how can I fix it? Faster: fast fast_2 Slower: small small_2

hjs2027864933 commented 6 days ago

Looking forward to your reply, thank you.