TGI version latest;single NVIDIA GeForce RTX 3090;
Information
[X] Docker
[ ] The CLI directly
Tasks
[X] An officially supported command
[ ] My own modifications
Reproduction
The first loading method (loading llama3 8B model from Hugging face):
model=meta-llama/Meta-Llama-3-8B-Instruct
volume=/home/data/Project/model # share a volume with the Docker container to avoid downloading weights every run
sudo docker run -it --name tgi_llama3_8B --restart=unless-stopped --shm-size 48g -p 3002:80 --runtime "nvidia" --gpus '"device=1"' -v $volume:/data \
-e HF_TOKEN=$token \
-e HF_ENDPOINT="https://hf-mirror.com" \
-e HF_HUB_ENABLE_HF_TRANSFER=False \
-e USE_FLASH_ATTENTION=False \
-e HF_HUB_OFFLINE=1 \
ghcr.chenby.cn/huggingface/text-generation-inference:latest \
--model-id $model
The second loading method (loading llama3 8B model from local directory):
model=/data/ans_model/meta-llama/Meta-Llama-3-8B-Instruct
volume=/home/data/Project/model # share a volume with the Docker container to avoid downloading weights every run
sudo docker run -it --name tgi_llama3_8B --restart=unless-stopped --shm-size 48g -p 3002:80 --runtime "nvidia" --gpus '"device=1"' -v $volume:/data \
-e HF_TOKEN=$token \
-e HF_ENDPOINT="https://hf-mirror.com" \
-e HF_HUB_ENABLE_HF_TRANSFER=False \
-e USE_FLASH_ATTENTION=False \
-e HF_HUB_OFFLINE=1 \
ghcr.chenby.cn/huggingface/text-generation-inference:latest \
--model-id $model
Expected behavior
The inference speed of the llama3 8B model loaded from Hugging face is much faster than that loaded from the local directory. I don't know why this happens, how can I fix it?
Faster:
Slower:
System Info
TGI version latest;single NVIDIA GeForce RTX 3090;
Information
Tasks
Reproduction
The first loading method (loading llama3 8B model from Hugging face):
The second loading method (loading llama3 8B model from local directory):
Expected behavior
The inference speed of the llama3 8B model loaded from Hugging face is much faster than that loaded from the local directory. I don't know why this happens, how can I fix it? Faster: Slower: