huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
1.82k stars 470 forks source link

Can't download one file of a public huggingface dataset #2361

Open orionw opened 4 days ago

orionw commented 4 days ago

Describe the bug

This HF Dataset is public and should need no key. Nearly all of the files can be downloaded except for a few, for example this one. For some reason this particular file gives a 403 error on both the UI and programmatically (e.g. download_snapshot).

Reproduction

Click on this link and hit the download button. You will see the 403 error:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>08DXVKPPDG349JDZ</RequestId>
<HostId>uUp4yoP8FL+EW4dBpOH56XmCaq92O1qGgsojYjGw0S6xH7W54EMIn9DXonBt0uL09a8+1XwF/EA=</HostId>
</Error>

If you want the link programmatically, you can do:

 from huggingface_hub import hf_hub_url, get_hf_file_metadata

file_url = hf_hub_url(
    repo_id="orionweller/reddit_mds_incremental", filename="reddit_0057/shard.00009.mds.zstd", repo_type="dataset", endpoint=None
)
print(file_url)

Which loads the same error when clicked or wget

Logs

No response

System info

Copy-and-paste the text below in your GitHub issue.

- huggingface_hub version: 0.23.4
- Platform: Linux-5.15.0-112-generic-x86_64-with-glibc2.35
- Python version: 3.10.14
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /home/oweller/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: orionweller
- Configured git credential helpers: store
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.3.1
- Jinja2: 3.1.4
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.3.0
- hf_transfer: 0.1.6
- gradio: N/A
- tensorboard: N/A
- numpy: 2.0.0
- pydantic: N/A
- aiohttp: 3.9.5
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/oweller/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/oweller/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/oweller/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
orionw commented 4 days ago

Also doesn't work with git lfs:

Cloning into 'reddit_mds_incremental'...
remote: Enumerating objects: 6177, done.
remote: Counting objects: 100% (6169/6169), done.
remote: Compressing objects: 100% (6169/6169), done.
remote: Total 6177 (delta 542), reused 0 (delta 0), pack-reused 8 (from 1)
Receiving objects: 100% (6177/6177), 2.01 MiB | 8.83 MiB/s, done.
Resolving deltas: 100% (542/542), done.
Updating files: 100% (5386/5386), done.
Downloading reddit_0057/shard.00009.mds.zstd (28 MB)4.60 MiB/s
Error downloading object: reddit_0057/shard.00009.mds.zstd (3b75114): Smudge error: Error downloading reddit_0057/shard.00009.mds.zstd (3b75114ebd17ea1c7acc26ef555acc1a28bc416c1a56735903466d03889bc28d): LFS: Authorization error: https://cdn-lfs-us-1.huggingface.co/repos/c2/af/c2afab1f010a1036b5f939c7cadc832fd26120ed8eb1f0940723c8075e2e41d3/3b75114ebd17ea1c7acc26ef555acc1a28bc416c1a56735903466d03889bc28d?Expires=1719943480&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxOTk0MzQ4MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2MyL2FmL2MyYWZhYjFmMDEwYTEwMzZiNWY5MzljN2NhZGM4MzJmZDI2MTIwZWQ4ZWIxZjA5NDA3MjNjODA3NWUyZTQxZDMvM2I3NTExNGViZDE3ZWExYzdhY2MyNmVmNTU1YWNjMWEyOGJjNDE2YzFhNTY3MzU5MDM0NjZkMDM4ODliYzI4ZCJ9XX0_&Signature=Ldho8oWbJajqph5FRwSDc0Sxz56%7EqAuhWOYXGEiCdTBOmM%7ELmV-lNzvJyAJ5Bmu04gf68NhqK55RMTw-SyH0xroqR1r3MUvJnhQU-UYz%7EqSKD%7EIT92rUQtPrrBT-FHRTxhDE4FHz%7EOLTAiPql9S1eTZMd6ys7rJ6OiL6IQFzk-4EjCv0j0FHGkTPJ1Dd17TWYuZ6utB%7ECz-t2KmTs7vynuq6iaj2-bYldyczHRjsoZVeNUenbmP10qNvMSyvZsfuh23xeyQDq4XfJp4kc4z3hPQ8GBZfQFJWTRphCiw9PBQzhRF6W8kA9HODN1b%7EyzRBgoKlX0xWXW5HxFH3p6uzKQ__&Key-Pair-Id=K24J24Z295AEI9
orionw commented 4 days ago

I found issue #865 which seems to be related. However, I have the file and when I push it it still doesn't change it. See commit e78a00e where I just uploaded the file again.

I am using:

      api.upload_folder(
          folder_path=local_path,
          repo_id=repo_name,
          path_in_repo=local_path,
          repo_type="dataset",
      )
Wauplin commented 1 day ago

Hi @orionw, sorry for the inconvenience. It might be related to a temporary config issue on our side. The consistency check that we run on file upload seems to have silently fail. Could you try re-uploading it now? If it still doesn't work, please let me know and we'll manually delete it ourselves. (another solution is to delete + recreate the entire repo on your side but that might take too much bandwidth/time on your side).