huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.12k stars 556 forks source link

Feature request: overwrite local dir when using Repository #260

Closed NielsRogge closed 1 year ago

NielsRogge commented 3 years ago

Lately, I've been using huggingface_hub to upload BEiT (a new model) checkpoints to the hub. I used the following code:

!sudo apt-get install git-lfs
!git config --global user.email "..."
!git config --global user.name "..."
!huggingface-cli login

from huggingface_hub import HfFolder, HfApi, Repository

# first, create repo on the hub
folder = HfFolder()
token = folder.get_token()

api = HfApi()
repo_url = api.create_repo(token=token, name="beit-large-patch16-512", organization="microsoft")
print("Created repo. Can be accessed on:", repo_url)

# next, create wrapper around remote repo
repo = Repository(local_dir="checkpoint", # note that this directory must not exist already
                  clone_from=repo_url,
                  git_user="...",
                  git_email="...",
                  use_auth_token=True,
)

# next, move model files to the "checkpoint" directory and upload to hub
(...)

When instantiating a Repository, the local_dir must not exist already with the current implementation. However, it would be useful if it just overwrites the local directory, in case it already exists, because I had to upload several checkpoints, and I had to remove that local directory each time I wanted to upload a new checkpoint.

cc @julien-c who suggested a overwrite_local_dir flag.

julien-c commented 3 years ago

want to open a PR to do this?

NielsRogge commented 3 years ago

Ok update: it works when your local directory is an empty repository (in which case it will be overwritten by the remote git repository), but if it already contains files and is not a git repository, then you get the following error:

!mkdir checkpoint

# put a file in there
!touch checkpoint/test.txt

from huggingface_hub import Repository

repo_url = "https://huggingface.co/microsoft/beit-large-patch16-224"

repo = Repository(local_dir="checkpoint", # note that this directory must not exist already
                  clone_from=repo_url,
                  git_user="Niels Rogge",
                  git_email="niels.rogge1@gmail.com",
                  use_auth_token=True,
)

which gives:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-15-7b379edad3e5> in <module>()
      7                   git_user="Niels Rogge",
      8                   git_email="niels.rogge1@gmail.com",
----> 9                   use_auth_token=True,
     10 )

1 frames
/usr/local/lib/python3.7/dist-packages/huggingface_hub/repository.py in clone_from(self, repo_url, use_auth_token)
    316                 if not in_repository:
    317                     raise EnvironmentError(
--> 318                         "Tried to clone a repository in a non-empty folder that isn't a git repository. If you really "
    319                         "want to do this, do it manually:\m"
    320                         "git init && git remote add origin && git pull origin main\n"

OSError: Tried to clone a repository in a non-empty folder that isn't a git repository. If you really want to do this, do it manually:\mgit init && git remote add origin && git pull origin main
 or clone repo to a new folder and move your existing files there afterwards.

The use case that I had was when I already had model files in my local directory, without it being a git repository.

julien-c commented 3 years ago

Yes, overwrite_local_dir would rm the local directory before doing anything else so it would fix this

NielsRogge commented 3 years ago

No it shouldn't remove it, it should basically join the files that already in the local directory with the files of the remote git repository, right?

julien-c commented 3 years ago

then what happens for a filename that's both in the local dir and in the remote repo?

We had something like this w/ @LysandreJik before, but the desired behavior was unspecified so we just removed it.

NielsRogge commented 3 years ago

then what happens for a filename that's both in the local dir and in the remote repo?

Yeah, that's indeed a good question, perhaps we can leave it like that in that case.

Edit: perhaps we can only allow it in case the remote repository has just been created (i.e. is an empty git repository).

osanseviero commented 3 years ago

I'm a bit confused by the use case

However, it would be useful if it just overwrites the local directory, in case it already exists, because I had to upload several checkpoints, and I had to remove that local directory each time I wanted to upload a new checkpoint.

Is your idea that files should be joined or that they should be overwritten? (or joined and just overwritten when same filename?)

NielsRogge commented 3 years ago

So my use case was the following:

1) I had some local files in a directory (pytorch_model.bin, config.json, vocab.txt). This directory was not a git repository, just a local directory. 2) I created a remote repository using api.create_repo. My goal was to upload my local files to that remote repository. 3) When I then use Repository, with local_dir being equal to my local directory of step 1, I get the error specified above. Ideally, I could just push those files to the remote repository in a subsequent step with repo.push_to_hub().

=> so perhaps we can only allow it in case the remote repository is empty.

osanseviero commented 3 years ago

I think the error suggestion should be good for this use case:

git init && git remote add origin https://huggingface.co/microsoft/beit-large-patch16-224  && git pull origin main 

Which will give you a local directory with your local files + the new files from the repo.

And then you should be able to follow-up with pushing. But maybe we could make this simpler indeed.

Wauplin commented 1 year ago

(closing as "wontfix" as Repository usage is deprecated anyway)