The same model, but different loading methods will result in very different inference speeds?

System Info

TGI version latest;single NVIDIA GeForce RTX 3090；

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

The first loading method (loading llama3 8B model from Hugging face):

model=meta-llama/Meta-Llama-3-8B-Instruct
volume=/home/data/Project/model # share a volume with the Docker container to avoid downloading weights every run
sudo docker run -it --name tgi_llama3_8B --restart=unless-stopped --shm-size 48g -p 3002:80 --runtime "nvidia" --gpus '"device=1"' -v $volume:/data \
    -e HF_TOKEN=$token \
    -e HF_ENDPOINT="https://hf-mirror.com" \
    -e HF_HUB_ENABLE_HF_TRANSFER=False \
    -e USE_FLASH_ATTENTION=False \
    -e HF_HUB_OFFLINE=1 \
    ghcr.chenby.cn/huggingface/text-generation-inference:latest \
    --model-id $model

The second loading method (loading llama3 8B model from local directory):

model=/data/ans_model/meta-llama/Meta-Llama-3-8B-Instruct
volume=/home/data/Project/model # share a volume with the Docker container to avoid downloading weights every run
sudo docker run -it --name tgi_llama3_8B --restart=unless-stopped --shm-size 48g -p 3002:80 --runtime "nvidia" --gpus '"device=1"' -v $volume:/data \
    -e HF_TOKEN=$token \
    -e HF_ENDPOINT="https://hf-mirror.com" \
    -e HF_HUB_ENABLE_HF_TRANSFER=False \
    -e USE_FLASH_ATTENTION=False \
    -e HF_HUB_OFFLINE=1 \
    ghcr.chenby.cn/huggingface/text-generation-inference:latest \
    --model-id $model

Expected behavior

The inference speed of the llama3 8B model loaded from Hugging face is much faster than that loaded from the local directory. I don't know why this happens, how can I fix it? Faster: fast fast_2 Slower: small small_2

huggingface / text-generation-inference

The same model, but different loading methods will result in very different inference speeds? #2757

System Info

Information

Tasks

Reproduction

Expected behavior