Planning to release artifacts (dataset, models) on Hugging Face?

NielsRogge commented 1 month ago

Hi @cs-mshah,

Niels here from the open-source team at Hugging Face. I discovered your work as it was submitted on AK's daily papers: https://huggingface.co/papers/2409.14677. The paper page lets people discuss the paper, and discover its artifacts (such as models, dataset, a demo in the form of a 🤗 Space).

It'd be great to make the checkpoints and dataset available on the 🤗 hub, to improve their discoverability/visibility. We can add tags so that people find them when filtering https://huggingface.co/models and https://huggingface.co/datasets.

Uploading models

See here for a guide: https://huggingface.co/docs/hub/models-uploading.

In this case, we could leverage the PyTorchModelHubMixin class which adds from_pretrained and push_to_hub to any custom nn.Module. Alternatively, one can leverages the hf_hub_download one-liner to download a checkpoint from the hub.

We encourage researchers to push each model checkpoint to a separate model repository, so that things like download stats also work. We can then also link the checkpoints to the paper page.

Uploading dataset

Would be awesome to make the dataset available on 🤗 , so that people can do:

from datasets import load_dataset

dataset = load_dataset("your-hf-org/your-dataset")

See here for a guide: https://huggingface.co/docs/datasets/image_dataset

Besides that, there's the dataset viewer which allows people to quickly explore the first few rows of the data in the browser.

Let me know if you're interested/need any help regarding this!

Cheers,

Niels ML Engineer @ HF 🤗

cs-mshah commented 1 month ago

Hi @NielsRogge, Thanks for reaching out. Yes, we do have a plan to release the dataset and checkpoints to the hub. I'm planning on the release. We also plan to release a v2 version of the dataset on top of the v1 version which was used in the original paper once its done.

cs-mshah commented 1 month ago

Hi @NielsRogge, I am trying to upload the dataset with the following script taken from some hf github issues since it is huge:

import autoroot
import os
from huggingface_hub import HfApi, CommitOperationAdd, HfFileSystem, preupload_lfs_files
from pathlib import Path
from loguru import logger as log
import argparse
import multiprocessing

api = HfApi(token=os.environ["token"])
fs = HfFileSystem(token=os.environ["token"])

def get_all_files(root: Path, include_patterns=[], ignore_patterns=[]):
    def is_ignored(path):
        for pattern in ignore_patterns:
            if pattern in str(path):
                return True
        return False

    def is_included(path):
        for pattern in include_patterns:
            if pattern in str(path):
                return True
        if len(include_patterns) == 0:
            return True
        return False

    dirs = [root]
    while len(dirs) > 0:
        dir = dirs.pop()
        for candidate in dir.iterdir():
            if candidate.is_file() and not is_ignored(candidate) and is_included(candidate):
                yield candidate
            if candidate.is_dir():
                dirs.append(candidate)

def get_groups_of_n(n: int, iterator):
    assert n > 1
    buffer = []
    for elt in iterator:
        if len(buffer) == n:
            yield buffer
            buffer = []
        buffer.append(elt)
    if len(buffer) != 0:
        yield buffer

def main(args):
    if args.operation == "upload":
        # api.create_tag(repo_id=args.repo_id, tag=args.revision, revision="main", exist_ok=True)
        remote_root = Path(os.path.join("datasets", args.repo_id))
        all_remote_files = fs.glob(os.path.join("datasets", args.repo_id, "**/*.hdf5"))
        all_remote_files = [
            str(Path(file).relative_to(remote_root)) for file in all_remote_files
        ]
        log.info(f"Found {len(all_remote_files)} remote files")
        args.ignore_patterns.extend(all_remote_files)

        root = Path(args.root_directory)
        num_threads = args.num_threads
        if num_threads is None:
            num_threads = multiprocessing.cpu_count()
        for i, file_paths in enumerate(get_groups_of_n(args.group_size, get_all_files(root, args.include_patterns, args.ignore_patterns))):
            log.info(f"Committing {len(file_paths)} files...")
            # path_in_repo is path of file_path relative to root_directory
            operations = [] # List of all `CommitOperationAdd` objects that will be generated
            for file_path in file_paths:
                addition = CommitOperationAdd(
                    path_in_repo=str(file_path.relative_to(Path(args.relative_root))),
                    path_or_fileobj=str(file_path),
                )
                preupload_lfs_files(
                    args.repo_id,
                    [addition],
                    token=os.environ["token"],
                    num_threads=num_threads,
                    repo_type="dataset",
                    revision=args.revision,
                )
                operations.append(addition)

            commit_info = api.create_commit(
                repo_id=args.repo_id,
                operations=operations,
                commit_message=f"Upload part {i}",
                repo_type="dataset",
                token=os.environ["token"],
                num_threads=num_threads,
                revision=args.revision,
            )
            log.info(f"Commit {i} done: {commit_info.commit_message}")

    elif args.operation == "delete":
        api.delete_folder(args.path_in_repo, 
                          repo_id=args.repo_id, 
                          repo_type="dataset", 
                          commit_description="Delete old folder", 
                          token=os.environ["token"])

if __name__ == "__main__":
    # python upload.py --root_directory /data/manan/data/objaverse/blenderproc/abo_rendered_data --num_thread 8
    parser = argparse.ArgumentParser()
    parser.add_argument("--operation", type=str, default="upload", choices=["upload", "delete"])
    parser.add_argument("--group_size", type=int, default=100)
    parser.add_argument("--repo_id", type=str, default="cs-mshah/SynMirror")
    parser.add_argument(
        "--relative_root",
        type=str,
        default="/data/manan/data/objaverse/blenderproc/",
        help="Relative root",
    )
    parser.add_argument("--revision", type=str, default="main", help="Revision to commit to")
    parser.add_argument("--root_directory", type=str, default="", help="Root directory to upload (or delete).")
    parser.add_argument("--path_in_repo", type=str, default="hf-objaverse-v1", help="Path in the repo to delete")
    parser.add_argument("--ignore_patterns", help="Patterns to ignore", nargs="+", default=["spurious", "resources"])
    parser.add_argument("--include_patterns", help="Patterns to include", nargs="+", default=["hdf5"])
    parser.add_argument("--num_threads", type=int, default=None, help="Number of threads to use for uploading.")
    args = parser.parse_args()
    main(args)

The upload is extremely slow and still gives 504 timeouts. I use a rerun script to keep calling this script until the entire dataset is uploaded. However, I am getting the following error which wasn't coming before:

Traceback (most recent call last):                                                                                                                                                            
  File "/home/ankitd/miniconda3/envs/brushnet/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status                                                
    response.raise_for_status()                                                                                                                                                               
  File "/home/ankitd/miniconda3/envs/brushnet/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status                                                                
    raise HTTPError(http_error_msg, response=self)                                                                                                                                            
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/datasets/cs-mshah/SynMirror/commit/main                                                      

The above exception was the direct cause of the following exception:                                                                                                                          

Traceback (most recent call last):                                                                                                                                                            
  File "/home/ankitd/manan/Reflection-Exploration/BlenderProc/reflection/upload.py", line 120, in <module>                                                                                    
    main(args)                                                                                                                                                                                
  File "/home/ankitd/manan/Reflection-Exploration/BlenderProc/reflection/upload.py", line 83, in main                                                                                         
    commit_info = api.create_commit(                                                                                                                                                          
                  ^^^^^^^^^^^^^^^^^^                                                                                                                                                          
  File "/home/ankitd/miniconda3/envs/brushnet/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn                                                      
    return fn(*args, **kwargs)                                                                                                                                                                
           ^^^^^^^^^^^^^^^^^^^                                                                                                                                                                
  File "/home/ankitd/miniconda3/envs/brushnet/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 1398, in _inner                                                                   
    return fn(self, *args, **kwargs)                                                                                                                                                          
           ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                          
  File "/home/ankitd/miniconda3/envs/brushnet/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 3852, in create_commit                                                            
    hf_raise_for_status(commit_resp, endpoint_name="commit")                                                                                                                                  
  File "/home/ankitd/miniconda3/envs/brushnet/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 358, in hf_raise_for_status                                                
    raise BadRequestError(message, response=response) from e                                                                                                                                  
huggingface_hub.utils._errors.BadRequestError:  (Request ID: Root=1-6708edd7-4242435b32806eb10c9363d9;3eefa13b-a0b7-4773-8201-56543394881c)                                                   

Bad request for commit endpoint:
Your push was rejected because it contains files larger than 10 MiB. Please use https://git-lfs.github.com/ to store large files. See also: https://hf.co/docs/hub/repositories-getting-started#terminal Offending files: - .gitattributes (ref: refs/heads/main)
Your push was rejected because it contains files larger than 10 MiB. Please use https://git-lfs.github.com/ to store large files. See also: https://hf.co/docs/hub/repositories-getting-started#terminal  Offending files:   - .gitattributes (ref: refs/heads/main)

NielsRogge commented 1 month ago

Thanks for the reproducer! I will ping @Wauplin here (maintainer of huggingface_hub)

Wauplin commented 1 month ago

Hi @cs-mshah, sorry for the non descriptive error you are getting. I'm not 100% sure why this is happening but could you:

retry with a entirely new repo (e.g. without git history). Just want to be sure it's not an existing problem with the git history that triggers this.
Can you simplify the commit logic to the following snippet? You don't need to manually preupload files before committing. It's optional and meant for use cases where you upload files "on the fly", e.g. you generate then as you upload them because dataset is too big to be stored on disk / on RAM. But this is not the case here since everything is stored on disk before uploading. Create_commit can take are of the upload process for you.

        for i, file_paths in enumerate(get_groups_of_n(args.group_size, get_all_files(root, args.include_patterns, args.ignore_patterns))):
            log.info(f"Committing {len(file_paths)} files...")
            # path_in_repo is path of file_path relative to root_directory
            operations = [
                CommitOperationAdd(
                    path_in_repo=str(file_path.relative_to(Path(args.relative_root))),
                    path_or_fileobj=str(file_path),
                )
                for file_path in file_paths
            ]
            commit_info = api.create_commit(
                repo_id=args.repo_id,
                operations=operations,
                commit_message=f"Upload part {i}",
                repo_type="dataset",
                token=os.environ["token"],
                num_threads=num_threads,
                revision=args.revision,
            )
            log.info(f"Commit {i} done: {commit_info.commit_message}")

Wauplin commented 1 month ago

Btw, if this is to create a CLI that uploads a large dataset stored on disk, I would recommend to use huggingface-cli upload-large-folder instead. You can document to you users how to use it with default values needed for this library. Otherwise you can also build a dedicated CLI on top of api.upload_large_folder(...). This helper is meant exactly for what you are trying to do e.g. "I want to upload a very large folder in a robust way". The process is resumable, contrary to a more classic approach.

cs-mshah commented 1 month ago

Thanks @Wauplin. The api.upload_large_folder() is new and seems to work. Lets wait and see once the entire dataset gets uploaded.

Wauplin commented 1 month ago

Yes, quite new indeed! It has already been stress tested a few times but happy to get feedback once you use it :hugs:

cs-mshah commented 1 month ago

seems like it got stuck in the middle. Let me re-run the script.

[update] upload has resumed. lets wait and watch.

cs-mshah commented 4 weeks ago

@Wauplin I'm constantly receiving gateway timeouts:

Wauplin commented 4 weeks ago

Hi @cs-mshah , sorry if it wasn't mentioned before but there are some repo size recommendations to know when uploading a large dataset: https://huggingface.co/docs/hub/repositories-recommendations. In particular, there is a hard-limit of 100000 files per repo. Please see the guide for solutions on how to mitigate this (basically, pack files together). I'm pretty sure the gateway timeouts you are receiving are due to the number of files and number of commits in your repository, making it super slow to process new commits.

cs-mshah commented 4 weeks ago

Yes. The commits are over 5k and files are almost 100000. What should I do? Should I create tars with 1000 HDF5 files?

Wauplin commented 4 weeks ago

I haven't dive into the file format but we usually recommend to use a file format that allows for grouping rows (say parquet file for instance) or if not possible, zipping files yes. Unfortunately, this also means you'll have to delete + recreate a new repo as the 100k limitation applies to the full git history.

cs-mshah commented 4 weeks ago

I see. I'll try zipping and uploading.

cs-mshah commented 3 weeks ago

Thanks @Wauplin @NielsRogge for your support. Creating tar files and uploading was super fast. You guys could add links in the api docs to the sections mentioning the file limits and best practices.

Here is the v1 version of the dataset: https://huggingface.co/datasets/cs-mshah/SynMirror

Wauplin commented 3 weeks ago

Nice! :fire:

val-iisc / Reflecting-Reality

Planning to release artifacts (dataset, models) on Hugging Face? #1

Uploading models

Uploading dataset