huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.73k stars 1.01k forks source link

Model downloads just *hang* #1186

Closed vgoklani closed 7 months ago

vgoklani commented 10 months ago

System Info

tag=ghcr.io/huggingface/text-generation-inference:1.1.0
model_id=HuggingFaceH4/zephyr-7b-alpha
# model_id=mistralai/Mistral-7B-Instruct-v0.1
validation_workers=2
sharded=false
max_concurrent_requests=4
max_total_tokens=4096
hostname=0.0.0.0
port=3001
gpuDeviceString="device=0"

docker run --net=host -i -t \
    --gpus ${gpuDeviceString} \
    --shm-size 32g \
    --volume ${MODELS_DIRECTORY}/deeplearning/huggingface:/data \
    -e HUGGING_FACE_HUB_TOKEN=${HUGGINGFACE_TOKEN} \
    ${tag} \
    --model-id ${model_id} \
    --validation-workers ${validation_workers} \
    --sharded ${sharded} \
    --max-concurrent-requests ${max_concurrent_requests} \
    --max-total-tokens ${max_total_tokens} \
    --hostname ${hostname} \
    --port ${port}

Information

Tasks

Reproduction

Run the docker script above, and then check the download folder. I can see the first 3 safetensors download successfully, and then it just hangs.

Status: Downloaded newer image for ghcr.io/huggingface/text-generation-inference:1.1.0
2023-10-22T13:01:16.673958Z  INFO text_generation_launcher: Args { model_id: "HuggingFaceH4/zephyr-7b-alpha", revision: None, validation_workers: 2, sharded: Some(false), num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 4, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 4096, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 3001, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-10-22T13:01:16.674057Z  INFO download: text_generation_launcher: Starting download process.
2023-10-22T13:01:21.453462Z  INFO text_generation_launcher: Download file: model-00001-of-00008.safetensors

2023-10-22T13:01:48.297410Z  INFO text_generation_launcher: Downloaded /data/models--HuggingFaceH4--zephyr-7b-alpha/snapshots/5d2b3629ad1799c7e5306e606e0052246098883b/model-00001-of-00008.safetensors in 0:00:26.

2023-10-22T13:01:48.297570Z  INFO text_generation_launcher: Download: [1/8] -- ETA: 0:03:02

2023-10-22T13:01:48.298299Z  INFO text_generation_launcher: Download file: model-00002-of-00008.safetensors

2023-10-22T13:02:45.464068Z  INFO text_generation_launcher: Downloaded /data/models--HuggingFaceH4--zephyr-7b-alpha/snapshots/5d2b3629ad1799c7e5306e606e0052246098883b/model-00002-of-00008.safetensors in 0:00:57.

2023-10-22T13:02:45.464139Z  INFO text_generation_launcher: Download: [2/8] -- ETA: 0:04:12

2023-10-22T13:02:45.464530Z  INFO text_generation_launcher: Download file: model-00003-of-00008.safetensors

2023-10-22T13:03:11.724879Z  INFO text_generation_launcher: Downloaded /data/models--HuggingFaceH4--zephyr-7b-alpha/snapshots/5d2b3629ad1799c7e5306e606e0052246098883b/model-00003-of-00008.safetensors in 0:00:26.

2023-10-22T13:03:11.724954Z  INFO text_generation_launcher: Download: [3/8] -- ETA: 0:03:03.333335

2023-10-22T13:03:11.725362Z  INFO text_generation_launcher: Download file: model-00004-of-00008.safetensors

When I switched the model_id to mistralai/Mistral-7B-Instruct-v0.1 the first tensor gets downloaded and then the process hangs.

Expected behavior

I expect this to just work out of the box, since it's using docker, and there is literally nothing for me to screw up ;)

vgoklani commented 10 months ago

To expand on this further, I restarted the download for mistralai/Mistral-7B-Instruct-v0.1 and it created a new tempfile and made it to 9.3G before hanging. There are no logs, and no errors, which makes it impossible to debug. Moreover, I'm happy to manually download the files via wget (which is both transparent and fast) but the download structure is very confusing. Where do I put the downloaded pytorch tensors? Is it possible to run the tokenizer client-side and just pass in the input_ids?

OlivierDehaene commented 10 months ago

Are you sure you have enough space in the path linked to ${MODELS_DIRECTORY} env var?

vgoklani commented 10 months ago

@OlivierDehaene There is no space issues (over 1T free). Is it possible to enable logging so we can see the error?

vgoklani commented 10 months ago

The downloader is definitely flakey...

When downloading teknium/OpenHermes-2-Mistral-7B which was two files, the first file downloaded immediately, but the second file hung at 4.3G.

We then killed and restarted the docker process and of course it re-downloaded the first file, and then another copy of the second file, then hung again at 4.3G.

We then deleted the tmp files and re-ran the download, and it worked correctly...

This morning we tried downloading HuggingFaceH4/zephyr-7b-beta and got stuck on the 4th file

We disabled the HF_HUB_ENABLE_HF_TRANSFER

-e HF_HUB_ENABLE_HF_TRANSFER="false" \

but still don't see any logging. It's impossible for us to debug this when we can't the errors.

Please help.

StephennFernandes commented 10 months ago

@vgoklani hey were you able to fix this ? facing similar issue

vgoklani commented 10 months ago

@StephennFernandes Something is definitely flakey, unfortunately we can't see the logging.

One thing to do is disable the HF_HUB_ENABLE_HF_TRANSFER:

-e HF_HUB_ENABLE_HF_TRANSFER="false" \

and then when the downloads get stuck, kill the docker process, delete the tmp files and just restart. Do this several times and just cross your fingers and hope it eventually* finishes successfully.

The problem is most likely server-side, but unfortunately its hard to be sure since we can't see the logs.

vgoklani commented 10 months ago

Any updates on this @OlivierDehaene

I know your team is busy, but we're trying to download this:

TheBloke/deepseek-coder-33B-instruct-AWQ

the first file downloads successfully in the temp directory, we see 6.4G and then nothing happens. The second file does not start downloading, and there are no updates on the console. We then kill the download and restart, it then re-downloads the first file in a new temp folder, and then just hangs again.

Could you please help?

We removed the AWQ file and went to the full 33B model, but the same thing keeps happening.

babeal commented 9 months ago

We are seeing the same issue with vilsonrodrigues/falcon-7b-instruct-sharded. You can see where it's hanging with the following commands:

docker exec -it <container-id> /bin/bash
ps -ef

The process output

root         1     0  0 15:58 ?        00:00:00 text-generation-launcher --model-id vilsonrodrigues/falcon-7b-instruct-sharded --num-shard 1 --quantize bitsandbytes
root         8     1  3 15:58 ?        00:00:43 /opt/conda/bin/python3.9 /opt/conda/bin/text-generation-server download-weights vilsonrodrigues/falcon-7b-instruct-sharded --extension .safetensors --logger-level INFO --json-output

Then run strace to see the last output of the download-weights command

apt-get update && apt-get-install strace -y && strace -p 8

Always hangs at the exact same spot

write(2, "\rDownloading (\342\200\246)of-00015.safet"..., 97

Update: Occurs with HF_HUB_ENABLE_HF_TRANSFER: "0" ; testing with 1 and it's completed the download a few times, so still testing.

VfBfoerst commented 9 months ago

Cheers, are you all behind a corporate proxy? We are issuing the same problem (behind a corporate proxy), maybe the proxy is the problem.

vgoklani commented 9 months ago

nope, tested in multiple environments!

erfanium commented 9 months ago

Same here. no activity from docker container This is my docker-compose.yml file for TheBloke/deepseek-coder-1.3B-base-AWQ model

version: "3.8"
services:
  hf-inference:
    restart: always
    image: ghcr.io/huggingface/text-generation-inference:1.1.1
    command: --model-id TheBloke/deepseek-coder-1.3B-base-AWQ
    volumes:
      - "./data:/data"
    ports:
      - 8080:80
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

model has already downloaded in a temp file (854MB of 895MB downloaded)

erfanium commented 9 months ago

@VfBfoerst No, I use plain internet without proxy, and I got the same issue

poojitharamachandra commented 9 months ago

hi, any solution yet?

Narsil commented 9 months ago

We are never able to reproduce.

Can you provide more information about what environment you are running in ? Also run BOTH

HF_HUB_ENABLE_HF_TRANSFER=0 AND =1 (in the docker environment).

It really looks like some kind of environment issue, but we really cannot help if we cannot reproduce.

erfanium commented 9 months ago

@Narsil This problem originates from the huggingface-hub downloader module. I've tried using both HF_HUB_ENABLE_HF_TRANSFER=0 and HF_HUB_ENABLE_HF_TRANSFER=1, but the issue persists in both cases.

Reproducing this issue is quite challenging and problem doesn't always occur (especially if you have a high-quality internet connection). I think you can start by testing the downloader with an unstable internet connection.

vgoklani commented 9 months ago

The downloads are indeed unstable, sometimes they work and sometimes they don't :)

We don't have any issues using huggingface_hub.snapshot_download and our data-center has a very stable internet connection :)

Narsil commented 9 months ago

We deploy many models per day without any hitch, ever. I don't doubt you are seeing issues, but from my perspective I cannot see the problem therefore I cannot help.

Everything is using hf_hub_download not snapshot_download although it shouldn't make any difference. There can technically be an issue with hf_transfer if it tries to open too many file descriptor for your machine, but deactivating it doesn't seem to help, therefore you're using raw python at this point.

I'm sorry but I really cannot help with the problems in your environment. At this point just try to isolate the problem so we can actually help.

Lauching text-generation-server download-weights for instance will only start that process, that could help in figuring it out. Please deactivate HF_TRANSFER so we stay in pure Python and you can see which part of requests is blocking.

I'm seriously expecting a bug in the environment. Something limiting the descriptors, some hard CPU rate limiting of the host, some firewall issue.

VfBfoerst commented 9 months ago

Hey, I found a workaround (at least in my case).
I started the container with podman (image: ghcr.io/huggingface/text-generation-inference:1.2). But instead of using the default entrypoint, I overwrote it with an intaractive bash (--entrypoing /bin/bash).
Then I started the download manually:
text-generation-server download-weights bigcode/starcoder
This download succeeds and it downloads all weights as expected - did not stop after the first weight which was the case after downloading it with text-generation-launcher.
Then, I started the launcher:
text-generation-launcher --model-id bigcode/starcoder

It detected the already downloaded weights and started the webserver. After that, the generation via API worked as expected.

vgoklani commented 8 months ago

I was watching this video on Mixtral where the speaker is discussing TGI and then he says "well there are some issues downloading weights when using TGI" ...

https://youtu.be/EXFbZfp8xCI?si=TEv3dvWgI3hZKwWJ&t=928

He's running this on RunPod, which i believe has a "very stable network"

peterschmidt85 commented 8 months ago

I'm experiencing the same issue on certain networks with MODEL_ID=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ. The model is downloaded 100% but hangs at the end before converting the temp file into the actual model file.

MODEL_ID=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ text-generation-launcher --hostname 0.0.0.0 --port 8000 --trust-remote-code --quantize gptq
2023-12-26T20:32:14.790191Z  INFO text_generation_launcher: Args { model_id: "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Gptq), speculate: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-12-26T20:32:14.790275Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ` do not contain malicious code.
2023-12-26T20:32:14.790469Z  INFO download: text_generation_launcher: Starting download process.
2023-12-26T20:32:18.089652Z  INFO text_generation_launcher: Download file: model.safetensors

Hangs forever

du -ha /data

23G     /data/tmppkarr8tk
4.0K    /data/version.txt
4.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/refs/main
8.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/refs
0       /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/.no_exist/0f81ba4680ccd2bce163334b93305d40b9e27b09/adapter_config.json
0       /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/.no_exist/0f81ba4680ccd2bce163334b93305d40b9e27b09/medusa_lm_head.pt
4.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/.no_exist/0f81ba4680ccd2bce163334b93305d40b9e27b09
8.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/.no_exist
4.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/snapshots/0f81ba4680ccd2bce163334b93305d40b9e27b09
8.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/snapshots
4.0K    /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/blobs
32K     /data/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ
0       /data/.locks/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ/5dec37d3dc6c82673e6d65871a7fcd581b165dbabce4eaf0a1fa83d96fd4b213.
lock
4.0K    /data/.locks/models--TheBloke--Mixtral-8x7B-Instruct-v0.1-GPTQ
8.0K    /data/.locks
23G     /data

If I invoke text-generation-server download-weights TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ manually, the issue doesn't occur. Also, if I run text-generation-server download-weights I actually see the progress of download – regardless whether I enable or disable HF_HUB_ENABLE_HF_TRANSFER. Does it mean that downloading is not accelerated?

BEpresent commented 8 months ago

I was watching this video on Mixtral where the speaker is discussing TGI and then he says "well there are some issues downloading weights when using TGI" ...

https://youtu.be/EXFbZfp8xCI?si=TEv3dvWgI3hZKwWJ&t=928

He's running this on RunPod, which i believe has a "very stable network"

Encountering the same issue while waiting for Mixtral to download without any feedback. It keeps stuck in the download phase Download file: model ... (waited over an hour, and tested on 3 different cloud instances)

josephrocca commented 8 months ago

@Narsil I can reliably reproduce this on Runpod with ghcr.io/huggingface/text-generation-inference:1.3.4 and:

--quantize gptq --model-id TheBloke/goliath-120b-GPTQ --revision gptq-3bit--1g-actorder_True --max-total-tokens 4096

Please give it a try.

Also, like peterschmidt85, I find that the value of the HF_HUB_ENABLE_HF_TRANSFER environment variable has no effect on whether or not logs are shown, both when using Runpod's environment variable settings, and when adding it before text-generation-server download-weights during manual download.

RDearnaley commented 8 months ago

@Narsil I'm also experiencing this issue, for downloading Mixtral-8x7B-Instruct-v0.1 (both the original, and also a duplicate I made on HF), under Kubernetes on Azure using the ghcr.io/huggingface/text-generation-inference:1.3.4 image. Let me know if you want a copy of the Kubernetes yaml files — it's extremely replicable. That model has 19 ~4.9GB .safetensor chunks, the download consistently hangs after either 12, 13, or 14 chunks. I have exec-ed to the pod and confirmed that the downloader process is still running (it eventually times out around 20 minutes later), and that the machine has plenty of disk space. Looking in the downloaded blobs directory, it appears to have all full-sized chunks with no partial chunk, so possibly the hang is during the process of starting the download of the next chunk.

I found https://github.com/huggingface/text-generation-inference/issues/354 from back in May which sounded very similar. A poster on that issue thread claimed to have worked around the issue by passing the option --net=host to Docker. I tried what I believe is the Kubernetes equivalent:

spec:
  template:
    spec:
      hostNetwork: true

without success. I also confirmed that if I exec into the pod and manually run the downloader command:

/opt/conda/bin/python3.10 /opt/conda/bin/text-generation-server download-weights mistralai/Mixtral-8x7B-Instruct-v0.1 --extension .safetensors --logger-level INFO --json-output

then that completes just fine, and the text-generation-loader spawns another weights download, which picks up the downloaded model. So now I have a running pod that I can only restart or scale via a slow manual process, thus defeating the entire point of Kubernetes.

This issue has already been open for 2 1/2 months, and is a blocker, sadly, so I'll be switching us over to using vLLM (yes, I searched, and they don't have a similar issue reported).

OlivierDehaene commented 8 months ago

@RDearnaley I wouldn't say that the issue is "extremely replicable". We have yet to see a reproducible example and we do not have the issue in any of our prod environements, our integration tests or even our dev VMs. It's extremely hard to make progress on this issue in these conditions.

Even in this thread, we have seen numerous talented people trying to reproduce/find out what is hapenning and nobody has a clue yet besides @peterschmidt85 who found that it might be linked to temporary files.

For example you state:

I also confirmed that if I exec into the pod and manually run the downloader command then that completes just fine.

The launcher runs the same command.

So now I have a running pod that I can only restart or scale via a slow manual process, thus defeating the entire point of Kubernetes.

so I'll be switching us over to using vLLM (yes, I searched, and they don't have a similar issue reported).

Sassy comments do not ever lead to magic fixes, it just leads to open source dev burnout. Be aware that they also use the huggingface_hub lib to download the weights.

RDearnaley commented 7 months ago

@RDearnaley I wouldn't say that the issue is "extremely replicable"

What I meant by:

Let me know if you want a copy of the Kubernetes yaml files — it's extremely replicable.

is that it's extremely replicable in our Azure Kubernetes pod, which is why I was asking if you wanted a copy of our Kubernetes script, since I already understood that you have so far been unable to replicate it.

The launcher runs the same command.

I'm aware of that (I said "manually run the downloader command", perhaps I should have been more explicit and said "copy-paste the downloader command from ps -elf and simply rerun that myself from the command line"), and also of the obvious implication that this is a Heisenbug. FWIW I believe I was running this as root. I did notice that running it manually showed progress bars, which I didn't see in the logs of Kubernetes running it. I was hoping that this might be a clue for you.

I regret that my comment explaining my company's situation of being unable to continue using your product apparently came over as "sassy" to you — that was not my intent. I was simply attempting to express that this is a complete blocker for us, and is unfortunately forcing use to stop using your product. I am hoping that the day-plus of work I put in on trying to help debug this difficult issue in your product, and the results of that which I included above, will be of some help.

On Tue, Jan 9, 2024 at 7:18 AM OlivierDehaene @.***> wrote:

@RDearnaley https://github.com/RDearnaley I wouldn't say that the issue is "extremely replicable". We have yet to see a reproducible example and we do not have the issue in any of our prod environements, our integration tests or even our dev VMs. It's extremely hard to make progress on this issue in these conditions.

Even in this thread, we have seen numerous talented people trying to reproduce/find out what is hapenning and nobody has a clue yet besides @peterschmidt85 https://github.com/peterschmidt85 who found that it might be linked to temporary files.

For example you state:

I also confirmed that if I exec into the pod and manually run the downloader command then that completes just fine.

The launcher runs the same command.

So now I have a running pod that I can only restart or scale via a slow manual process, thus defeating the entire point of Kubernetes.

so I'll be switching us over to using vLLM (yes, I searched, and they don't have a similar issue reported).

Sassy comments do not ever lead to magic fixes, it just leads to open source dev burnout. Be aware that they also use the huggingface_hub lib to download the weights.

— Reply to this email directly, view it on GitHub https://github.com/huggingface/text-generation-inference/issues/1186#issuecomment-1883244714, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHDCWLSND7LI3CXZVFLJGS3YNVNTLAVCNFSM6AAAAAA6K5APMWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBTGI2DINZRGQ . You are receiving this because you were mentioned.Message ID: @.***>

josephrocca commented 7 months ago

The sass was definitely uncalled for.

@OlivierDehaene @Narsil Here is a step by step replication which always produces the infinite "hang" for me, and I've tried it probably a dozen times with different machines on Runpod, including Community Cloud and Secure Cloud. This is exactly what I did several times a week ago, and I've just tested these steps again today on two different runpod machines (one in Community, and one in Secure).

  1. Visit https://www.runpod.io/console/gpu-secure-cloud and click the "deploy" button for a 2x RTX 4090 machine.
  2. It'll take you to a page with a "Customize Deployment" button - click it.
  3. Paste ghcr.io/huggingface/text-generation-inference:1.3.4 in place of the Docker Image Name input.
  4. Paste --quantize gptq --model-id TheBloke/goliath-120b-GPTQ --revision gptq-3bit--1g-actorder_True --max-total-tokens 4096 in place of the Docker Command input.
  5. Increase Volume Disk to 200GB
  6. Click "Set Overrides" to save those inputs.
  7. Click "Continue"
  8. Click "Deploy"

Here's a video of that process:

https://github.com/huggingface/text-generation-inference/assets/1167575/2267ee62-7831-4da3-9bfc-d29ebece98ee

And then click the "Pods" page via the button in the side bar and the machine will be listed. Once it starts up, click the logs button and go to the "Container Logs" tab. You'll see it either:

  1. Hang forever, or:
  2. (Very rarely) hang for a while and then show an error (which hints at using HF_HUB_ENABLE_HF_TRANSFER=0, but that doesn't change it for me), and then retry and hang forever.

So far as I can see this is 100% replicable with the above steps. That is of course not to say that this makes it easy to fix, but I hope this helps the investigation somewhat.

maziyarpanahi commented 7 months ago

I have a simple setup for my personal tests: no K8, local storage, no proxy, and basic Docker, running TGI with several 70B models and many smaller ones. The downloads generally work smoothly. However, occasionally, I encounter a model with one large model.safetensor file. It's not split into smaller files, and while downloading, the speed often drops from 900 MBit/s to 0 after 20-30GB, making it challenging to maintain a single download stream. This results in frequent stops and restarts, creating multiple tmp_ directories each time.

35G     data/tmp656691ko
23G     data/tmp_d9232pz
23G     data/tmpexwlks4h
35G     data/tmpgv2_dt0_
23G     data/tmph7k75wio

These are my attempts to download a model that has one 36GB model.safetensor file. It even died twice after downloading almost 35GB. I don't have any issue downloading even larger models if they have been sharded to max 10GB in size for instance.

PS: I forgot to share my personal workarounds. If I use that download via AutoModelForCausalLM then I just copy the files to my TGI data/ volume. If it's my own model, I try to reshard it by saving it via max shard set to 9G. Either of these solves my problem of downloading 1 large file.

martinkozle commented 7 months ago

Maybe this comment isn't very useful as I don't have any new insights to add. I had the same issue where I restarted the container a bunch of times and waited and it never finished downloading. Then I decided to try the suggestion here to change the entrypoint, exec into the container and manually run the command I found in ps -ef, and it worked first time... I currently only have 2 deployments so I can manually work around this, but it is definitely an annoying bug.

lenntt commented 7 months ago

I've been struggling with this issue for a while now, getting very similar errors and no problem downloading via text-generation-server download-weights $model

it was the following (facepalm incoming, but hope it helps others): I blindly copied over volume=$PWD/data from the docs, but ./data didn't exist. Changing the volume to an existing location location seemed to start the download for me. A clearer error would've been nice.

I still got some unclear hangs, but restart gets me a bit further, so downloading bit by bit but at least it gets me started :)

erfanium commented 7 months ago

I'm happy to manually download the files via wget (which is both transparent and fast) but the download structure is very confusing.

I totally agree with this suggestion from @vgoklani. Seams this issue won't be resolved anytime soon, so it's really helpful to be able to serve models from a pre-downloaded model directory. without having to use hugginface_hub

mssalvatore commented 7 months ago

The culprit

I think I've tracked this down, and it's not at all what I expected. I've tracked it down to the progress bar in huggingface_hub:

https://github.com/huggingface/huggingface_hub/blob/e80cb600406127a9eb357f1843457a9fc4675e89/src/huggingface_hub/file_download.py#L544-L551

What I did

Like others have mentioned, I ran text-generation-server download-weights manually and did not see any failures. I attempted to use pudb to track down what was failing, but as no failures were encountered that was a dead end.

With the magic of print debugging, I added my own debug statements and wrote them out to a file. Here's a diff of my changes:

diff --git a/src/huggingface_hub/file_download.py b/src/huggingface_hub/file_download.py
index abc82e1..428afe1 100644
--- a/src/huggingface_hub/file_download.py
+++ b/src/huggingface_hub/file_download.py
@@ -18,6 +18,11 @@ from pathlib import Path
 from typing import Any, BinaryIO, Dict, Generator, Literal, Optional, Tuple, Union
 from urllib.parse import quote, urlparse

+def write_debug(msg: str):
+    with open("/data/debug-log.txt", "a") as f:
+        f.write(msg)
+        f.flush()
+
 import requests
 from filelock import FileLock

@@ -375,6 +380,7 @@ def _raise_if_offline_mode_is_enabled(msg: Optional[str] = None):
 def _request_wrapper(
     method: HTTP_METHOD_T, url: str, *, follow_relative_redirects: bool = False, **params
 ) -> requests.Response:
+    write_debug(f"{method}: {url}\n")
     """Wrapper around requests methods to add several features.

     What it does:
@@ -422,7 +428,12 @@ def _request_wrapper(
         return response

     # Perform request and return if status_code is not in the retry list.
-    response = get_session().request(method=method, url=url, **params)
+    write_debug("calling get_session()\n")
+    s = get_session()
+    write_debug("finished get_session()\n")
+    write_debug("Making request\n")
+    response = s.request(method=method, url=url, **params)
+    write_debug("finished request\n")
     hf_raise_for_status(response)
     return response

@@ -460,6 +471,7 @@ def http_get(
                     " available in your environment. Try `pip install hf_transfer`."
                 )

+    #import pudb; pu.db
     initial_headers = headers
     headers = copy.deepcopy(headers) or {}
     if resume_size > 0:
@@ -512,6 +524,7 @@ def http_get(
                     "using `pip install -U hf_transfer`."
                 )
             try:
+                write_debug("before hf_transfer.download()\n")
                 hf_transfer.download(
                     url=url,
                     filename=temp_file.name,
@@ -522,7 +535,9 @@ def http_get(
                     max_retries=5,
                     **({"callback": progress.update} if supports_callback else {}),
                 )
+                write_debug("after hf_transfer.download()\n")
             except Exception as e:
+                write_debug("hf_transfer.download() runtime error\n")
                 raise RuntimeError(
                     "An error occurred while downloading using `hf_transfer`. Consider"
                     " disabling HF_HUB_ENABLE_HF_TRANSFER for better error handling."
@@ -538,14 +553,24 @@ def http_get(
             return
         new_resume_size = resume_size
         try:
+            write_debug("before chunk iter\n")
             for chunk in r.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
                 if chunk:  # filter out keep-alive new chunks
+                    write_debug("got chunk\n")
+                    write_debug(f"len(chunk): {len(chunk)}\n")
+                    write_debug("updating progress\n")
                     progress.update(len(chunk))
+                    write_debug("progress updated\n")
                     temp_file.write(chunk)
+                    write_debug("temp_file.write() returned\n")
                     new_resume_size += len(chunk)
                     # Some data has been downloaded from the server so we reset the number of retries.
                     _nb_retries = 5
+                else:
+                    write_debug("not chunk!!\n")
+            write_debug("after chunk iter\n")
         except (requests.ConnectionError, requests.ReadTimeout) as e:
+            write_debug(f"Chunk iter error {e}\n")
             # If ConnectionError (SSLError) or ReadTimeout happen while streaming data from the server, it is most likely
             # a transient error (network outage?). We log a warning message and try to resume the download a few times
             # before giving up. Tre retry mechanism is basic but should be enough in most cases.
@@ -1457,6 +1482,8 @@ def hf_hub_download(
                 _check_disk_space(expected_size, os.path.dirname(blob_path))
                 if local_dir is not None:
                     _check_disk_space(expected_size, local_dir)
+            else:
+                write_debug("Warning! Expected_size is None\n")

             http_get(
                 url_to_download,

I then ran the modified container with the debugging output and monitored both the temporary file size and the output file. I was able to see that the tiiuae/falcon-7b-instruct model consistently hung. Each time it hung, the last thing printed in my output file was "updating progress". Note that the chunk size was always exactly the same.

Hypothesis

It's my hypothesis that, because the docker container is not displaying stdout, the STDOUT buffer or some other output/tty buffer fills up and, once full, progress bar updates block, causing the download to inexplicably hang.

Why this might make sense

  1. It seems to be what my debug logging indicates, consistently.
  2. Commenting out the progress bar update line allows downloads to complete, consistently.
  3. It would explain why when text-generation-server download-weights is run manually, no one can reproduce the issue.
  4. It would explain why users on "very stable networks" can still reproduce the issue (i.e. it's not a networking issue).

Why this might not make sense

  1. If a buffer were a certain size, I would expect it to block at the same time during each download. However, my downloads hung at different points.
  2. I would expect this issue to be more widespread and reproducible than it appears to be if it really had to do with the progress bar.

UPDATE!

To further test this hypothesis, I added the following loop just before the download iterates:

@@ -538,14 +553,29 @@ def http_get(
             return
         new_resume_size = resume_size
         try:
+            write_debug("Trying to fill buffers...")
+            for i in range (0, 10_000_000):
+                write_debug(f"progress: {i}\n")
+                print("a"*64)
+                print("a"*64, file=sys.stderr)
+            write_debug("before chunk iter\n")
             for chunk in r.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
                 if chunk:  # filter out keep-alive new chunks

The program consistently freezes after 1006 iterations of the new loop.

skydiablo commented 7 months ago

just an idea, try to run the container with --log-driver none ... seems to be the default logging driver is some json-file stuff, with an maximum buffer size. this can be blocking at the end, also maybe some parameters like logging in a unblocking state will help or increase the buffer size? -> --log-opt mode=non-blocking --log-opt max-buffer-size=4m

see: https://docs.docker.com/config/containers/logging/configure/

mssalvatore commented 7 months ago

@skydiablo That looks like it should work, but it doesn't seem to, which casts some doubt on my hypothesis. The documentation for --log-opt mode=non-blocking suggests it should solve the problem if my hypothesis is correct.

I did another experiment to try and confirm my hypothesis. Again, here's my loop:

@@ -538,14 +553,29 @@ def http_get(
             return
         new_resume_size = resume_size
         try:
+            write_debug("Trying to fill buffers...")
+            for i in range (0, 10_000_000):
+                write_debug(f"progress: {i}\n")
+                print("a"*64)
+                print("a"*64, file=sys.stderr)
+            write_debug("before chunk iter\n")
             for chunk in r.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
                 if chunk:  # filter out keep-alive new chunks
  1. I removed the line printing to STDOUT and the behavior is the same (hangs after 1006 loops), so it appears that STDERR is the culprit. This is consistent with the hypothesis because tqdm displays the progress bar to STDERR by default.
  2. I reduced the number of bytes from 64 to 32 and then to 16 to see how many loops would complete. The results are as follows:
bytes per loop # loops completed
64 1006
32 1981
16 3836

This is the behavior even when --log-driver none or --log-opt mode=non-blocking. I really would have expected the non-blocking mode to have solved the issue.

Unfortunately, the original issue with an unmodded container does not appear reproducible for me at the present moment. The next time I'm able to reproduce the issue with the original container I'll give these logging options a shot to see if they help.

skydiablo commented 7 months ago

i tried this already and in my case this isnt helped :(

mssalvatore commented 7 months ago

Same for me. Changing the log driver or mode does not seem to have an effect. The only thing I can find that's consistent is that the version where I commented out the progress bar update always works, whereas the download hangs more often than not when I allow the progress bar to update.

skydiablo commented 7 months ago

btw, i do not see any progress as well? image ???

mssalvatore commented 7 months ago

btw, i do not see any progress as well? image ???

Right. The progress bar is being sent to STDERR, which isn't displayed.

skydiablo commented 7 months ago

and there is no way to force to show it, like 2> &1 or something like this?

puppetm4st3r commented 7 months ago

for me with -e HF_HUB_ENABLE_HF_TRANSFER="true" at least I can download at faster speeds then the hang occurs with the 2nd o 3rd safetensor file, so i can run the process until all safetensors files are donwloaded, but for models with no shard files you will need a bw of a lot mb/sec

mssalvatore commented 7 months ago

I submitted https://github.com/huggingface/huggingface_hub/pull/2000 to huggingface_hub. In between now and when the fix makes its way into TGI, you can try running your containers with -e HF_HUB_DISABLE_PROGRESS_BARS=1 -e HF_HUB_ENABLE_HF_TRANSFER=0 and see if that resolves the issue.

OlivierDehaene commented 7 months ago

@mssalvatore it seems that you are right. The size of this buffer also seem to be system dependent, hence why we didn't run into it.

1486 should fix the issue.

OlivierDehaene commented 7 months ago

Can you guys test ghcr.io/huggingface/text-generation-inference:sha-9c320e2 and tell me if the issue is fixed?

dtpowl commented 7 months ago

It's fixed for me. Thank you!

OlivierDehaene commented 7 months ago

As 1.4 is out with the fix I will close this issue. Thanks a lot to @mssalvatore, your help was essential in fixing this issue and is greatly appreciated.

lwabish commented 3 months ago

Met this again with docker compose file:

version: "3.8"

services:
  tgi:
    image: ghcr.io/huggingface/text-generation-inference:2.0.3
    volumes:
      - ./data:/data
    restart: "no"
    ports:
      - 8080:80
    command: --model-id teknium/OpenHermes-2.5-Mistral-7B
    shm_size: 1g
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]