huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.25k stars 2.69k forks source link

Issue Downloading Certain Datasets After Setting Custom `HF_ENDPOINT` #6728

Closed padeoe closed 8 months ago

padeoe commented 8 months ago

Describe the bug

This bug is triggered under the following conditions:

Steps to reproduce the bug

the issue can be reproduced with the following code:

  1. install specific datasets and huggingface_hub.
    pip install datasets==2.18.0
    pip install huggingface_hub==0.21.4
  2. execute python code.
    import os
    os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
    from datasets import load_dataset
    bookcorpus = load_dataset('bookcorpus', split='train')

    console output:

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/home/padeoe/.local/lib/python3.10/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
    File "/home/padeoe/.local/lib/python3.10/site-packages/datasets/load.py", line 2228, in load_dataset_builder
    dataset_module = dataset_module_factory(
    File "/home/padeoe/.local/lib/python3.10/site-packages/datasets/load.py", line 1879, in dataset_module_factory
    raise e1 from None
    File "/home/padeoe/.local/lib/python3.10/site-packages/datasets/load.py", line 1830, in dataset_module_factory
    with fs.open(f"datasets/{path}/{filename}", "r", encoding="utf-8") as f:
    File "/home/padeoe/.local/lib/python3.10/site-packages/fsspec/spec.py", line 1295, in open
    self.open(
    File "/home/padeoe/.local/lib/python3.10/site-packages/fsspec/spec.py", line 1307, in open
    f = self._open(
    File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 228, in _open
    return HfFileSystemFile(self, path, mode=mode, revision=revision, block_size=block_size, **kwargs)
    File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 615, in __init__
    self.resolved_path = fs.resolve_path(path, revision=revision)
    File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 180, in resolve_path
    repo_and_revision_exist, err = self._repo_and_revision_exist(repo_type, repo_id, revision)
    File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 117, in _repo_and_revision_exist
    self._api.repo_info(repo_id, revision=revision, repo_type=repo_type)
    File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
    File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2413, in repo_info
    return method(
    File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
    File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2286, in dataset_info
    hf_raise_for_status(r)
    File "/home/padeoe/.local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 362, in hf_raise_for_status
    raise HfHubHTTPError(str(e), response=response) from e
    huggingface_hub.utils._errors.HfHubHTTPError: 401 Client Error: Unauthorized for url: https://hf-mirror.com/api/datasets/bookcorpus/bookcorpus.py (Request ID: Root=1-65ee8659-5ab10eec5960c63e71f2bb58;b00bdbea-fd6e-4a74-8fe0-bc4682ae090e)

Expected behavior

The dataset was downloaded correctly without any errors.

Environment info

datasets==2.18.0 huggingface-hub==0.21.4

padeoe commented 8 months ago

Through debugging, I found a potential solution is to modify the code in the error handling module of huggingface_hub: https://github.com/huggingface/huggingface_hub/commit/56d6c798c44e83d2a3167e74c022737d8fcbe822

padeoe commented 8 months ago

@Wauplin

Wauplin commented 8 months ago

Thanks for investigating and reporting the bug @padeoe! I've opened a PR in huggingface_hub with your suggested fix! :) https://github.com/huggingface/huggingface_hub/pull/2119