Open stas00 opened 10 months ago
Thanks for the great feedback @stas00! I haven't looked into this for a long time but all suggestions seems to make sense. Will keep you updated when I start working on that :)
Super! The foundation is awesome, @Wauplin - just needs a bit of polish on top.
The other problem this tool isn't take care of is cleaning downloads/
- I checked and we had 2M of files there! I had to do:
sudo find /data/huggingface/datasets/downloads -type f -mtime +3 -exec rm {} \+
sudo find /data/huggingface/datasets/downloads -type d -empty -delete
I'm not sure if huggingface-cli
could take care of datasets
as well, since they come from the hub. Please let me know if I should open a separate issue, since it's related but not the same.
@stas00 datasets
cache is another topic! At the moment datasets
doesn't use the default cache shared between huggingface_hub
, transformers
, diffusers
, etc. We will have to fix that first before providing a CLI to clean this cache too (cc @lhoestq for visibility).
Do you want me to open an Issue here or on the datasets repo?
Better for the datasets
repo but please tag me. Thanks!
Hey @Wauplin, thanks a lot for pointing to another issue i could work on! I could not yet fully get into it, but i think i can help out here and would love to work on this! As soon as i was able to work out something i will open a draft PR (or some clarifying questions 😄). Will probably be next week.
Hey @Wauplin, found finally some time to look into this (later than expected, sorry about this). Not sure if i grasped this all correctly, so i try to lay out the way forward to avoid confusion.
Suggestion for moving forward from here:
> huggingface-cli cache delete-repo repo_id
> huggingface-cli cache delete-repo repo_id --repo-type=dataset
> huggingface-cli cache delete-repo repo_id --include (glob)
> huggingface-cli cache delete-repo repo_id --exclude (glob)
> huggingface-cli cache delete-repo repo_id --files=file1,file2,file3,file4
Is this correct? If yes i would start by implementing the new subcommand structure and the options in separate PR's.
Hi @lappemic, thanks for getting back to me. I don't think 1. and 5. should be tackled as part of this issue but rather in https://github.com/huggingface/huggingface_hub/pull/2221 directly. I find them a bit unrelated to 2. 3. and 4. which are focused on improving the current workflow.
I think that 2. and 4. can be tackled as part of the same PR while 3. (which adds a new option) can be done in a separate one. What do you tihnk?
Thanks a lot for the feedback! Sounds good to me! Will open a draft PR next week as soon as i get to it!
I tried
huggingface-cli delete-cache --disable-tui
for the first time. Great intention, very problematic usage when one had thousands of hub objects to cleanup. Once I understood its quirks I was able to hack around those problems.The doc says to hit
y
but the program exits ify
is hit and all careful manual editing is lost and the user has to start from scratch (ouch!)I suspect this is a bug or a problem in the workflow.
Please don't use a true temp file, use a file that won't get deleted and a user can re-use it should they hit the wrong button - see Issue 1 above as an example.
sorting to have
main
last consistently would help. e.g. the first few entries had a consistent-main-last listing as in:so I started to manually uncomment lines thinking
main
is always last, w/o paying close attention, but luckily I caught this was inconsistent as then I run into:and many other variations. For those who need to edit hundreds of these, it'd be great to have
main
first or last - probably actually first would be the easiest.I tried to do it manually and it was super slow and I was concerned my edits will get lost again if I hit Y instead of the confusing N (see Issue 1)
at the end I resorted to this hack:
and hit N, Y, Y
so I wiped out hundreds of old revisions in a second w/o manual editing.
This is usually what users want - keep the main, get rid of old revs - would it be possible to create such option?
Thank you!