huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.1k stars 547 forks source link

huggingface_hub.utils._errors.HfHubHTTPError: 504 Server Error: Gateway Time-out for url #2375

Open cs-mshah opened 4 months ago

cs-mshah commented 4 months ago

Describe the bug

I am trying to upload a large dataset to HF, but I frequently encounter timeouts with the following error:

    raise HfHubHTTPError(str(e), response=response) from e                                                                                                                                    
huggingface_hub.utils._errors.HfHubHTTPError: 504 Server Error: Gateway Time-out for url: <url>

Reproduction

I have the following script after a lot of searching on uploading a large dataset (~100GB) to huggingface datasets using the huggingface hub api:

import autoroot
import os
from huggingface_hub import HfApi, CommitOperationAdd, HfFileSystem, preupload_lfs_files
from pathlib import Path
from loguru import logger as log
import argparse
import multiprocessing

api = HfApi(token=os.environ["token"])
fs = HfFileSystem(token=os.environ["token"])

def get_all_files(root: Path, include_patterns=[], ignore_patterns=[]):
    def is_ignored(path):
        for pattern in ignore_patterns:
            if pattern in str(path):
                return True
        return False

    def is_included(path):
        for pattern in include_patterns:
            if pattern in str(path):
                return True
        if len(include_patterns) == 0:
            return True
        return False

    dirs = [root]
    while len(dirs) > 0:
        dir = dirs.pop()
        for candidate in dir.iterdir():
            if candidate.is_file() and not is_ignored(candidate) and is_included(candidate):
                yield candidate
            if candidate.is_dir():
                dirs.append(candidate)

def get_groups_of_n(n: int, iterator):
    assert n > 1
    buffer = []
    for elt in iterator:
        if len(buffer) == n:
            yield buffer
            buffer = []
        buffer.append(elt)
    if len(buffer) != 0:
        yield buffer

def main(args):
    if args.operation == "upload":
        remote_root = Path(os.path.join("datasets", args.repo_id))
        all_remote_files = fs.glob(os.path.join("datasets", args.repo_id, "**/*.hdf5"))
        all_remote_files = [
            str(Path(file).relative_to(remote_root)) for file in all_remote_files
        ]
        args.ignore_patterns.extend(all_remote_files)

        root = Path(args.root_directory)
        num_threads = args.num_threads
        if num_threads is None:
            num_threads = multiprocessing.cpu_count()
        for i, file_paths in enumerate(get_groups_of_n(args.group_size, get_all_files(root, args.include_patterns, args.ignore_patterns))):
            log.info(f"Committing {len(file_paths)} files...")
            # path_in_repo is path of file_path relative to root_directory
            operations = [] # List of all `CommitOperationAdd` objects that will be generated
            for file_path in file_paths:
                addition = CommitOperationAdd(
                    path_in_repo=str(file_path.relative_to(Path(args.relative_root))),
                    path_or_fileobj=str(file_path),
                )
                preupload_lfs_files(
                    args.repo_id,
                    [addition],
                    token=os.environ["token"],
                    num_threads=num_threads,
                    repo_type="dataset",
                )
                operations.append(addition)

            commit_info = api.create_commit(
                repo_id=args.repo_id,
                operations=operations,
                commit_message=f"Upload part {i}",
                repo_type="dataset",
                token=os.environ["token"],
                num_threads=num_threads
            )
            log.info(f"Commit {i} done: {commit_info.commit_message}")

    elif args.operation == "delete":
        api.delete_folder(args.path_in_repo, 
                          repo_id=args.repo_id, 
                          repo_type="dataset", 
                          commit_description="Delete old folder", 
                          token=os.environ["token"])

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--operation", type=str, default="upload", choices=["upload", "delete"])
    parser.add_argument("--group_size", type=int, default=100)
    parser.add_argument("--repo_id", type=str)
    parser.add_argument(
        "--relative_root",
        type=str,
        help="Relative root",
    )
    parser.add_argument("--root_directory", type=str, help="Root directory to upload (or delete).")
    parser.add_argument("--path_in_repo", type=str, help="Path in the repo to delete")
    parser.add_argument("--ignore_patterns", help="Patterns to ignore", nargs="+", default=["spurious", "resources"])
    parser.add_argument("--include_patterns", help="Patterns to include", nargs="+", default=["hdf5", "csv"])
    parser.add_argument("--num_threads", type=int, default=None, help="Number of threads to use for uploading.")
    args = parser.parse_args()
    main(args)

Logs

No response

System info

- huggingface_hub version: 0.23.4
- Platform: Linux-6.5.0-41-generic-x86_64-with-glibc2.35
- Python version: 3.10.14
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /data/.cache/token
- Has saved token ?: True
- Who am I ?: cs-mshah
- Configured git credential helpers: store
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.3.1
- Jinja2: 3.0.3
- Graphviz: N/A
- keras: 2.14.0
- Pydot: N/A
- Pillow: 9.5.0
- hf_transfer: 0.1.6
- gradio: 3.50.0
- tensorboard: N/A
- numpy: 1.26.4
- pydantic: 2.7.4
- aiohttp: 3.9.5
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /data/.cache/hub
- HF_ASSETS_CACHE: /data/.cache/assets
- HF_TOKEN_PATH: /data/.cache/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

I have separately added this to the `.env` file:

HF_HUB_ENABLE_HF_TRANSFER=1
HF_HUB_ETAG_TIMEOUT=500
Wauplin commented 4 months ago

Hi @cs-mshah, for uploading very large folders to the Hub, you might want to have a look at https://github.com/huggingface/huggingface_hub/pull/2254. It's not merged yet but starts to be mature. It's an upload method with advanced retry mechanisms that should help you.