huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.02k stars 531 forks source link

HF Hub gets stuck in infinite loop for uploading datasets from Hetzner storage #2445

Closed sleepingcat4 closed 1 month ago

sleepingcat4 commented 1 month ago

Describe the bug

A few days earlier, I was able to upload file from my ubuntu node which uses hetzner as remote storage. it's connected using sshfs method. I'm aware Hetzner storage will first take move the file on my ubuntu node and then upload it on HF.

I have a transfer rate between storage and node of 1GBPS and gigabytes fiber connection on the ubuntu node. So, in this scenario it shouldn't be taking more than 40 minutes since it was mere 40GB.

I was able to do this a few days earlier since today it suddenly broke and the HF HUB went to an infinite loop. I waited for like hours and nothing happened.

What's frustrating? I wasn't able to understand what was going wrong. When I interrupted the execution, it showed the program trying to chunk my file.

I am not sure what is actually going wrong behind HF hub and most importantly, Can we have better log? Or a verbose option. It will allow to debug the issue and prevent from falling in such pitfalls again.

Reproduction

No response

Logs

No response

System info

Latest HF Hub Version
RAM: 1T
OS: Ubuntu
Wauplin commented 1 month ago

I wasn't able to understand what was going wrong. When I interrupted the execution, it showed the program trying to chunk my file.

Could you share the stacktrace of when this was happening?

I am not sure what is actually going wrong behind HF hub and most importantly, Can we have better log? Or a verbose option.

What are you using to upload your dataset? In the CLI you should see all logs at an INFO level or above. If you want more, you can use huggingface_hub.logging.set_verbosity_debug + upload_folder in a small script which should print even more logs. If you have some and are still struggling, it would be helpful to share them.

I have a transfer rate between storage and node of 1GBPS and gigabytes fiber connection on the ubuntu node. So, in this scenario it shouldn't be taking more than 40 minutes since it was mere 40GB.

Sometimes it's a bit optimistic to calculate things like this since you're never sure if you're using all the available bandwidth or not. Or even if the bandwidth is the limiting factor here. Btw, in the upload process, the data is usually read twice (once for hashing, once for uploading). But several hours without anything happening is quite long, I agree.


Regardless of these errors, I could recommend you to read https://huggingface.co/docs/hub/repositories-recommendations which gives some limits and recommendations about repos on the Hub. A 40GB file is usually not recommended as it's harder to upload it + it makes the download process more hazardous for end users.

Finally, I can also recommend to look into https://github.com/huggingface/huggingface_hub/pull/2254. It's a tool we've built to upload large folders to the Hub. It's not merged yet but it is installable from source and has already been heavily tested.

sleepingcat4 commented 1 month ago

Could you share the stacktrace of when this was happening?

Unfortunately can't since I didn't capture it during upload when the initial error happened. but, what I remember the infinite loop was busy chunking the dataset into small sized data packets to hash them I believe or just chunking them before going for the hashing.

What are you using to upload your dataset? In the CLI you should see all logs at an INFO level or above. If you want more, you can use huggingface_hub.logging.set_verbosity_debug + upload_folder in a small script which should print even more logs. If you have some and are still struggling, it would be helpful to share them.

thanks! I will keep this in mind from next time.

Sometimes it's a bit optimistic to calculate things like this since you're never sure if you're using all the available bandwidth or not. Or even if the bandwidth is the limiting factor here. Btw, in the upload process, the data is usually read twice (once for hashing, once for uploading). But several hours without anything happening is quite long, I agree.

the exceptionally long time taken by the HF hub was the concerning factor for me. That lead me to raise this issue here. Even in worse case scenerio, I assume it was hashing 100 GB, it shouldn't take 6-8 hours. I believe I have some of the most fastest connections and computers because I was using this on an Intel cluster.

Regardless of these errors, I could recommend you to read https://huggingface.co/docs/hub/repositories-recommendations which gives some limits and recommendations about repos on the Hub. A 40GB file is usually not recommended as it's harder to upload it + it makes the download process more hazardous for end users.

I will keep these recommendation in mind from now. Btw I had an question: If I use datasets to convert my dataset into HF dataset object I don't encounter the error of job operation failure for the parquet file viewer. But, when I upload using HF Hub from the terminal (CLI) I encounter this job failure error and it takes 21 hours to be fixed. I either assume @severo fixes it or there's some automatic fix that happens behind the scene.

Can I receive some clear understanding what goes wrong? And I was interested in knowing when I chunk the dataset and upload automatically using the datsets library it gets converted into shards and uploaded but the uploaded gets inside a data folder. is there any chance I can rename this folder data to something else?

Wauplin commented 1 month ago

Btw I had an question: If I use datasets to convert my dataset into HF dataset object I don't encounter the error of job operation failure for the parquet file viewer. But, when I upload using HF Hub from the terminal (CLI) I encounter this job failure error and it takes 21 hours to be fixed. I either assume @severo fixes it or there's some automatic fix that happens behind the scene.

Can I receive some clear understanding what goes wrong? And I was interested in knowing when I chunk the dataset and upload automatically using the datsets library it gets converted into shards and uploaded but the uploaded gets inside a data folder. is there any chance I can rename this folder data to something else?

I think those are multiple questions better suited for the datasets repository since it's very datasets related and not just the upload process. Could you transfer that to https://github.com/huggingface/datasets (opening a new ticket referencing this comment) and close this issue here if that's ok with you. Thanks!