Closed rickvanveen closed 3 months ago
It downloads it and convert it to safetensors. Depending on your disk speed this can take a while (and RAM + swap).
safetensors is required for sharding, and to ease the maintenance it's done by default. If you're using some cache, this steps onkly happens once though.
And you should have some logs about what is happening.
You could try -e HF_HUB_ENABLE_HF_TRANSFER=0
to disable the fast download and see what's happening during the download.
I waited over the weekend, the pod was online for 3 days. I cannot imagine it would take that long.
Thank you for the suggestion, I added your suggestion like so:
env:
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "0"
to my deployment. I'm not getting any additional output in the logs.
I have to correct my earlier statement that it wasn't working with version 0.9. It is working with 0.9.3, and 0.9.4, but not with latest, 1.0.1, and 1.0.2. Also another twist I tried this with a smaller model "google/flan-t5-small" and this worked, but the falcon models still have the same issue.
What are the IO throughputs of the disk associated with your PVC storageclass?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
System Info
Using ghcr.io/huggingface/text-generation-inference:latest, but same issue with 0.9, and 1.02. Trying to deploy with model_id "tiiuae/falcon-7b-instruct"
Information
Tasks
Reproduction
My kubernetes deployment and pvc config:
After applying this the only logs I get from the container are these:
The container does download some tmpxxxx file with the same size as the model file but there it just stops. The process is still running though:
Expected behavior
I expect the container to download the model and serve it.