Closed NielsRogge closed 3 weeks ago
Hi @NielsRogge, Thanks for reaching out. Yes, we do have a plan to release the dataset and checkpoints to the hub. I'm planning on the release. We also plan to release a v2 version of the dataset on top of the v1 version which was used in the original paper once its done.
Hi @NielsRogge, I am trying to upload the dataset with the following script taken from some hf github issues since it is huge:
import autoroot
import os
from huggingface_hub import HfApi, CommitOperationAdd, HfFileSystem, preupload_lfs_files
from pathlib import Path
from loguru import logger as log
import argparse
import multiprocessing
api = HfApi(token=os.environ["token"])
fs = HfFileSystem(token=os.environ["token"])
def get_all_files(root: Path, include_patterns=[], ignore_patterns=[]):
def is_ignored(path):
for pattern in ignore_patterns:
if pattern in str(path):
return True
return False
def is_included(path):
for pattern in include_patterns:
if pattern in str(path):
return True
if len(include_patterns) == 0:
return True
return False
dirs = [root]
while len(dirs) > 0:
dir = dirs.pop()
for candidate in dir.iterdir():
if candidate.is_file() and not is_ignored(candidate) and is_included(candidate):
yield candidate
if candidate.is_dir():
dirs.append(candidate)
def get_groups_of_n(n: int, iterator):
assert n > 1
buffer = []
for elt in iterator:
if len(buffer) == n:
yield buffer
buffer = []
buffer.append(elt)
if len(buffer) != 0:
yield buffer
def main(args):
if args.operation == "upload":
# api.create_tag(repo_id=args.repo_id, tag=args.revision, revision="main", exist_ok=True)
remote_root = Path(os.path.join("datasets", args.repo_id))
all_remote_files = fs.glob(os.path.join("datasets", args.repo_id, "**/*.hdf5"))
all_remote_files = [
str(Path(file).relative_to(remote_root)) for file in all_remote_files
]
log.info(f"Found {len(all_remote_files)} remote files")
args.ignore_patterns.extend(all_remote_files)
root = Path(args.root_directory)
num_threads = args.num_threads
if num_threads is None:
num_threads = multiprocessing.cpu_count()
for i, file_paths in enumerate(get_groups_of_n(args.group_size, get_all_files(root, args.include_patterns, args.ignore_patterns))):
log.info(f"Committing {len(file_paths)} files...")
# path_in_repo is path of file_path relative to root_directory
operations = [] # List of all `CommitOperationAdd` objects that will be generated
for file_path in file_paths:
addition = CommitOperationAdd(
path_in_repo=str(file_path.relative_to(Path(args.relative_root))),
path_or_fileobj=str(file_path),
)
preupload_lfs_files(
args.repo_id,
[addition],
token=os.environ["token"],
num_threads=num_threads,
repo_type="dataset",
revision=args.revision,
)
operations.append(addition)
commit_info = api.create_commit(
repo_id=args.repo_id,
operations=operations,
commit_message=f"Upload part {i}",
repo_type="dataset",
token=os.environ["token"],
num_threads=num_threads,
revision=args.revision,
)
log.info(f"Commit {i} done: {commit_info.commit_message}")
elif args.operation == "delete":
api.delete_folder(args.path_in_repo,
repo_id=args.repo_id,
repo_type="dataset",
commit_description="Delete old folder",
token=os.environ["token"])
if __name__ == "__main__":
# python upload.py --root_directory /data/manan/data/objaverse/blenderproc/abo_rendered_data --num_thread 8
parser = argparse.ArgumentParser()
parser.add_argument("--operation", type=str, default="upload", choices=["upload", "delete"])
parser.add_argument("--group_size", type=int, default=100)
parser.add_argument("--repo_id", type=str, default="cs-mshah/SynMirror")
parser.add_argument(
"--relative_root",
type=str,
default="/data/manan/data/objaverse/blenderproc/",
help="Relative root",
)
parser.add_argument("--revision", type=str, default="main", help="Revision to commit to")
parser.add_argument("--root_directory", type=str, default="", help="Root directory to upload (or delete).")
parser.add_argument("--path_in_repo", type=str, default="hf-objaverse-v1", help="Path in the repo to delete")
parser.add_argument("--ignore_patterns", help="Patterns to ignore", nargs="+", default=["spurious", "resources"])
parser.add_argument("--include_patterns", help="Patterns to include", nargs="+", default=["hdf5"])
parser.add_argument("--num_threads", type=int, default=None, help="Number of threads to use for uploading.")
args = parser.parse_args()
main(args)
The upload is extremely slow and still gives 504 timeouts. I use a rerun script to keep calling this script until the entire dataset is uploaded. However, I am getting the following error which wasn't coming before:
Traceback (most recent call last):
File "/home/ankitd/miniconda3/envs/brushnet/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
response.raise_for_status()
File "/home/ankitd/miniconda3/envs/brushnet/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/datasets/cs-mshah/SynMirror/commit/main
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ankitd/manan/Reflection-Exploration/BlenderProc/reflection/upload.py", line 120, in <module>
main(args)
File "/home/ankitd/manan/Reflection-Exploration/BlenderProc/reflection/upload.py", line 83, in main
commit_info = api.create_commit(
^^^^^^^^^^^^^^^^^^
File "/home/ankitd/miniconda3/envs/brushnet/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ankitd/miniconda3/envs/brushnet/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 1398, in _inner
return fn(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ankitd/miniconda3/envs/brushnet/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 3852, in create_commit
hf_raise_for_status(commit_resp, endpoint_name="commit")
File "/home/ankitd/miniconda3/envs/brushnet/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 358, in hf_raise_for_status
raise BadRequestError(message, response=response) from e
huggingface_hub.utils._errors.BadRequestError: (Request ID: Root=1-6708edd7-4242435b32806eb10c9363d9;3eefa13b-a0b7-4773-8201-56543394881c)
Bad request for commit endpoint:
Your push was rejected because it contains files larger than 10 MiB. Please use https://git-lfs.github.com/ to store large files. See also: https://hf.co/docs/hub/repositories-getting-started#terminal Offending files: - .gitattributes (ref: refs/heads/main)
Your push was rejected because it contains files larger than 10 MiB. Please use https://git-lfs.github.com/ to store large files. See also: https://hf.co/docs/hub/repositories-getting-started#terminal Offending files: - .gitattributes (ref: refs/heads/main)
Thanks for the reproducer! I will ping @Wauplin here (maintainer of huggingface_hub)
Hi @cs-mshah, sorry for the non descriptive error you are getting. I'm not 100% sure why this is happening but could you:
for i, file_paths in enumerate(get_groups_of_n(args.group_size, get_all_files(root, args.include_patterns, args.ignore_patterns))):
log.info(f"Committing {len(file_paths)} files...")
# path_in_repo is path of file_path relative to root_directory
operations = [
CommitOperationAdd(
path_in_repo=str(file_path.relative_to(Path(args.relative_root))),
path_or_fileobj=str(file_path),
)
for file_path in file_paths
]
commit_info = api.create_commit(
repo_id=args.repo_id,
operations=operations,
commit_message=f"Upload part {i}",
repo_type="dataset",
token=os.environ["token"],
num_threads=num_threads,
revision=args.revision,
)
log.info(f"Commit {i} done: {commit_info.commit_message}")
Btw, if this is to create a CLI that uploads a large dataset stored on disk, I would recommend to use huggingface-cli upload-large-folder
instead. You can document to you users how to use it with default values needed for this library. Otherwise you can also build a dedicated CLI on top of api.upload_large_folder(...)
. This helper is meant exactly for what you are trying to do e.g. "I want to upload a very large folder in a robust way". The process is resumable, contrary to a more classic approach.
Thanks @Wauplin. The api.upload_large_folder()
is new and seems to work. Lets wait and see once the entire dataset gets uploaded.
Yes, quite new indeed! It has already been stress tested a few times but happy to get feedback once you use it :hugs:
seems like it got stuck in the middle. Let me re-run the script.
[update] upload has resumed. lets wait and watch.
@Wauplin I'm constantly receiving gateway timeouts:
Hi @cs-mshah , sorry if it wasn't mentioned before but there are some repo size recommendations to know when uploading a large dataset: https://huggingface.co/docs/hub/repositories-recommendations. In particular, there is a hard-limit of 100000 files per repo. Please see the guide for solutions on how to mitigate this (basically, pack files together). I'm pretty sure the gateway timeouts you are receiving are due to the number of files and number of commits in your repository, making it super slow to process new commits.
Yes. The commits are over 5k and files are almost 100000. What should I do? Should I create tars with 1000 HDF5 files?
I haven't dive into the file format but we usually recommend to use a file format that allows for grouping rows (say parquet file for instance) or if not possible, zipping files yes. Unfortunately, this also means you'll have to delete + recreate a new repo as the 100k limitation applies to the full git history.
I see. I'll try zipping and uploading.
Thanks @Wauplin @NielsRogge for your support. Creating tar files and uploading was super fast. You guys could add links in the api docs to the sections mentioning the file limits and best practices.
Here is the v1
version of the dataset: https://huggingface.co/datasets/cs-mshah/SynMirror
Nice! :fire:
Hi @cs-mshah,
Niels here from the open-source team at Hugging Face. I discovered your work as it was submitted on AK's daily papers: https://huggingface.co/papers/2409.14677. The paper page lets people discuss the paper, and discover its artifacts (such as models, dataset, a demo in the form of a π€ Space).
It'd be great to make the checkpoints and dataset available on the π€ hub, to improve their discoverability/visibility. We can add tags so that people find them when filtering https://huggingface.co/models and https://huggingface.co/datasets.
Uploading models
See here for a guide: https://huggingface.co/docs/hub/models-uploading.
In this case, we could leverage the PyTorchModelHubMixin class which adds
from_pretrained
andpush_to_hub
to any customnn.Module
. Alternatively, one can leverages the hf_hub_download one-liner to download a checkpoint from the hub.We encourage researchers to push each model checkpoint to a separate model repository, so that things like download stats also work. We can then also link the checkpoints to the paper page.
Uploading dataset
Would be awesome to make the dataset available on π€ , so that people can do:
See here for a guide: https://huggingface.co/docs/datasets/image_dataset
Besides that, there's the dataset viewer which allows people to quickly explore the first few rows of the data in the browser.
Let me know if you're interested/need any help regarding this!
Cheers,
Niels ML Engineer @ HF π€