huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.12k stars 556 forks source link

`huggingface-cli delete-cache --disable-tui` improvements #1997

Open stas00 opened 10 months ago

stas00 commented 10 months ago

I tried huggingface-cli delete-cache --disable-tui for the first time. Great intention, very problematic usage when one had thousands of hub objects to cleanup. Once I understood its quirks I was able to hack around those problems.

  1. Confusing UX
huggingface-cli delete-cache --disable-tui
TUI is disabled. In order to select which revisions you want to delete, please edit
the following file using the text editor of your choice. Instructions for manual
editing are located at the beginning of the file. Edit the file, save it and confirm
to continue.
File to edit: /tmp/tmpundr7lky.txt
0 revisions selected counting for 0.0. Continue ? (y/N)

The doc says to hit y but the program exits if y is hit and all careful manual editing is lost and the user has to start from scratch (ouch!)

I suspect this is a bug or a problem in the workflow.

  1. Please don't use a true temp file, use a file that won't get deleted and a user can re-use it should they hit the wrong button - see Issue 1 above as an example.

  2. sorting to have main last consistently would help. e.g. the first few entries had a consistent-main-last listing as in:

# Model mistralai/Mistral-7B-v0.1 (29.5G, used 19 hours ago)
    5e9c98b96d071dce59368012254c55b0ec6f8658 # Refs: (detached) # modified 3 months ago
    17688039ea7b7001c702797eed2ab7716a0cc3c2 # Refs: (detached) # modified 6 weeks ago
#    26bca36bde8333b5d7f72e9ed20ccda6a618af24 # Refs: main # modified 6 weeks ago

so I started to manually uncomment lines thinking main is always last, w/o paying close attention, but luckily I caught this was inconsistent as then I run into:

# Model mistralai/Mixtral-8x7B-v0.1 (93.4G, used 10 hours ago)
#    985aa055896a8f943d4a9f2572e6ea1341823841 # Refs: main # modified 5 weeks ago
    58301445dc1378584211722b7ebf8743ec4e192b # Refs: (detached) # modified 5 weeks ago

and many other variations. For those who need to edit hundreds of these, it'd be great to have main first or last - probably actually first would be the easiest.

  1. give a user a way to delete all non-main revisions in one go

I tried to do it manually and it was super slow and I was concerned my edits will get lost again if I hit Y instead of the confusing N (see Issue 1)

at the end I resorted to this hack:

cp /tmp/tmpedbz00ox.txt cache.txt
perl -pi -e 's|^#(.*detached.*)|$1|' cache.txt
cat cache.txt >>  /tmp/tmpundr7lky.txt

and hit N, Y, Y

so I wiped out hundreds of old revisions in a second w/o manual editing.

This is usually what users want - keep the main, get rid of old revs - would it be possible to create such option?

Thank you!

Wauplin commented 9 months ago

Thanks for the great feedback @stas00! I haven't looked into this for a long time but all suggestions seems to make sense. Will keep you updated when I start working on that :)

stas00 commented 9 months ago

Super! The foundation is awesome, @Wauplin - just needs a bit of polish on top.

The other problem this tool isn't take care of is cleaning downloads/ - I checked and we had 2M of files there! I had to do:

sudo find /data/huggingface/datasets/downloads -type f -mtime +3 -exec rm {} \+
sudo find /data/huggingface/datasets/downloads -type d -empty -delete

I'm not sure if huggingface-cli could take care of datasets as well, since they come from the hub. Please let me know if I should open a separate issue, since it's related but not the same.

Wauplin commented 9 months ago

@stas00 datasets cache is another topic! At the moment datasets doesn't use the default cache shared between huggingface_hub, transformers, diffusers, etc. We will have to fix that first before providing a CLI to clean this cache too (cc @lhoestq for visibility).

stas00 commented 9 months ago

Do you want me to open an Issue here or on the datasets repo?

Wauplin commented 9 months ago

Better for the datasets repo but please tag me. Thanks!

stas00 commented 9 months ago

Done: https://github.com/huggingface/datasets/issues/6614

lappemic commented 2 months ago

Hey @Wauplin, thanks a lot for pointing to another issue i could work on! I could not yet fully get into it, but i think i can help out here and would love to work on this! As soon as i was able to work out something i will open a draft PR (or some clarifying questions 😄). Will probably be next week.

lappemic commented 1 month ago

Hey @Wauplin, found finally some time to look into this (later than expected, sorry about this). Not sure if i grasped this all correctly, so i try to lay out the way forward to avoid confusion.

Suggestion for moving forward from here:

  1. Implement a new subcommand structure (from PR #2221 )
    > huggingface-cli cache delete-repo repo_id
    > huggingface-cli cache delete-repo repo_id --repo-type=dataset
    > huggingface-cli cache delete-repo repo_id --include (glob)
    > huggingface-cli cache delete-repo repo_id --exclude (glob)
    > huggingface-cli cache delete-repo repo_id --files=file1,file2,file3,file4
    • Mark the delete-cache command as deprecated in the documentation and in the CLI itself
  2. Improve the temporary file handling
    • Replace the current temporary file with a fixed file (e.g., ./tmp_hf_cleaner.txt) when TUI is disabled
    • Ensure that user edits are not lost if they accidentally choose the wrong option
  3. Enhance revision selection
    • Add an option to select all "detached" revisions at once, keeping only main or specific tags
    • Implement consistent sorting of revisions, preferably with the main revision always in the same position (first or last)
  4. Clarify the confirmation process
    • Revise the wording when confirming deletions to avoid confusion
    • Ensure that the confirmation process is consistent and intuitive
  5. Address the .DS_Store issue
    • Fix the problem where hidden files like .DS_Store prevent the complete deletion of directories

Is this correct? If yes i would start by implementing the new subcommand structure and the options in separate PR's.

Wauplin commented 3 weeks ago

Hi @lappemic, thanks for getting back to me. I don't think 1. and 5. should be tackled as part of this issue but rather in https://github.com/huggingface/huggingface_hub/pull/2221 directly. I find them a bit unrelated to 2. 3. and 4. which are focused on improving the current workflow.

I think that 2. and 4. can be tackled as part of the same PR while 3. (which adds a new option) can be done in a separate one. What do you tihnk?

lappemic commented 3 weeks ago

Thanks a lot for the feedback! Sounds good to me! Will open a draft PR next week as soon as i get to it!