huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

`push_to_hub` overwrite argument #7241

Closed ceferisbarov closed 4 weeks ago

ceferisbarov commented 1 month ago

Feature request

Add an overwrite argument to the push_to_hub method.

Motivation

I want to overwrite a repo without deleting it on Hugging Face. Is this possible? I couldn't find anything in the documentation or tutorials.

Your contribution

I can create a PR.

lhoestq commented 1 month ago

Hi ! Do you mean deleting all the files ? or erasing the repository git history before push_to_hub ?

ceferisbarov commented 1 month ago

Hi! I meant the latter.

lhoestq commented 1 month ago

I don't think there is a huggingface_hub utility to erase the git history, cc @Wauplin maybe ?

Wauplin commented 1 month ago

What is the goal exactly of deleting all the git history without deleting the repo?

Wauplin commented 1 month ago

You can use super_squash_commit to squash all the commits into a single one, hence deleting the git history. This is not exactly what you asked for since it squashes the commits for a specific revision (example: "all commits on main"). This means that if other branches exists, they are kept the same. Also if some PRs are already opened on the repo, they will become unmergeable since the commits will have diverted.

lhoestq commented 4 weeks ago

So the solution is:

from huggingface_hub import HfApi
repo_id = "username/dataset_name"
ds.push_to_hub(repo_id)
HfApi().super_squash_commit(repo_id)

This way you erase previous git history to end up with only 1 commit containing your dataset. Still, I'd be curious why it's important in your case. Is it to save storage space ? or to disallow loading old versions of the data ?

ceferisbarov commented 4 weeks ago

Thanks, everyone! I am building a new dataset and playing around with column names, splits, etc. Sometimes I push to the hub to share it with other teammates, I don't want those variations to be part of the repo. Deleting the repo from the website takes a little time, but it also loses repo settings that I have set, since I always set it to public with manually approved requests.

BTW, I had to write HfApi().super_squash_history(repo_id, repo_type="dataset"), but otherwise it works.

Wauplin commented 4 weeks ago

@ceferisbarov just to let you know, recreating a gated repo + granting access to your teammates is something that you can automate with something like this (not fully tested but should work):

from huggingface_hub import HfApi

api = HfApi()
api.delete_repo(repo_id, repo_type="dataset", missing_ok=True)
api.create_repo(repo_id, repo_type="dataset", private=False)
api.update_repo_settings(repo_id, repo_type="dataset", gated="manual")
for user in ["user1", "user2"]  # list of teammates
    api.grant_access(repo_id, user, repo_type="dataset")

I think it'd be a better solution than squashing commits (which is more of a hack), typically if you are using the dataset viewer.

ceferisbarov commented 4 weeks ago

This is great, @Wauplin. If we can achieve this with HfApi, then we probably don't need to add another parameter to push_to_hub. I am closing the issue.