Closed ceferisbarov closed 4 weeks ago
Hi ! Do you mean deleting all the files ? or erasing the repository git history before push_to_hub ?
Hi! I meant the latter.
I don't think there is a huggingface_hub
utility to erase the git history, cc @Wauplin maybe ?
What is the goal exactly of deleting all the git history without deleting the repo?
You can use super_squash_commit
to squash all the commits into a single one, hence deleting the git history. This is not exactly what you asked for since it squashes the commits for a specific revision (example: "all commits on main"). This means that if other branches exists, they are kept the same. Also if some PRs are already opened on the repo, they will become unmergeable since the commits will have diverted.
So the solution is:
from huggingface_hub import HfApi
repo_id = "username/dataset_name"
ds.push_to_hub(repo_id)
HfApi().super_squash_commit(repo_id)
This way you erase previous git history to end up with only 1 commit containing your dataset. Still, I'd be curious why it's important in your case. Is it to save storage space ? or to disallow loading old versions of the data ?
Thanks, everyone! I am building a new dataset and playing around with column names, splits, etc. Sometimes I push to the hub to share it with other teammates, I don't want those variations to be part of the repo. Deleting the repo from the website takes a little time, but it also loses repo settings that I have set, since I always set it to public with manually approved requests.
BTW, I had to write HfApi().super_squash_history(repo_id, repo_type="dataset")
, but otherwise it works.
@ceferisbarov just to let you know, recreating a gated repo + granting access to your teammates is something that you can automate with something like this (not fully tested but should work):
from huggingface_hub import HfApi
api = HfApi()
api.delete_repo(repo_id, repo_type="dataset", missing_ok=True)
api.create_repo(repo_id, repo_type="dataset", private=False)
api.update_repo_settings(repo_id, repo_type="dataset", gated="manual")
for user in ["user1", "user2"] # list of teammates
api.grant_access(repo_id, user, repo_type="dataset")
I think it'd be a better solution than squashing commits (which is more of a hack), typically if you are using the dataset viewer.
This is great, @Wauplin. If we can achieve this with HfApi, then we probably don't need to add another parameter to push_to_hub. I am closing the issue.
Feature request
Add an
overwrite
argument to thepush_to_hub
method.Motivation
I want to overwrite a repo without deleting it on Hugging Face. Is this possible? I couldn't find anything in the documentation or tutorials.
Your contribution
I can create a PR.