huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.02k stars 531 forks source link

fix/issue 2090 : Add a `repo_files` command, with recursive deletion. #2280

Closed OlivierKessler01 closed 4 months ago

OlivierKessler01 commented 4 months ago

Implement a new repo-files command to manipulate files (deletion...)

Fixes https://github.com/huggingface/huggingface_hub/issues/2235

Opiniated choice made regarding globs matching

Crux

I've encoutered an issue when implementing the folder matching as per this issue specification :

huggingface-cli repo-files <repo_id> delete folder/

To implement the pattern lexer I reused huggingface_hub.hf_api._prepare_folder_deletions(), which is already used in the upload command to lexe/tokenize allow and disallow lists.

The function is a wrapper around fnmatch.fnmatch() (via huggingface_hub.utils._paths.filter_repo_objects()). It matches files against unix shell-style file matching patterns.

Unix shell-style automatons respond to this case in an rather unconvenient way :

>>> fnmatch.fnmatch("nested/file.bin", "nested/")
False

Solution implemented

It adds "*" at the end of the patterns that end with "/". Given git untracks any folder that is empty, deleting all files in the folder effectively deletes the folder. Did it in the new huggingface_hub.hf_api.delete_files_r() function, because updating huggingface_hub.utils._paths.filter_repo_objects() at a lower-level would alter the behavior of various functions like the upload command and the commit scheduler in a non-BC compatible way.

What do you think about it ?

huggingface-cli upload --delete="nested/"
#the line above still doesn't work, but it's consistent (logic unchanged, interface unchanged). 

huggingface-cli repo-files delete nested/ #interpreted as `huggingface-cli repo-files delete nested/*`
#the line above works as per https://github.com/huggingface/huggingface_hub/issues/2235 specs.

All this has been documented in the function doc.

What could have been done instead

Remove support for this huggingface-cli repo-files <repo_id> delete folder/

Natively support this through already existing logic, and let Git untrack the folder. huggingface-cli repo-files <repo_id> delete folder/*

Tests added (duration that might disrupt DevX and project output)

20.46s call     tests/test_hf_api.py::CommitApiTest::test_delete_files_r

Remarks

Feel free to tell me if you'd like something different here :)

HuggingFaceDocBuilderDev commented 4 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

OlivierKessler01 commented 4 months ago

@Wauplin just ran Ruff, sorry 'bout this.

OlivierKessler01 commented 4 months ago

Hi @Wauplin thank you for the contribution. I just added the documentation block, have a look :)

Wauplin commented 4 months ago

And, merged!