iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.95k stars 1.19k forks source link

dvc gc remove <datafile-or-dir> #4218

Open edwardwbarber opened 4 years ago

edwardwbarber commented 4 years ago

This has already been mentioned a few times in #2325 but wanted to draw attention again to this aspect specifically:

Since dvc pull and dvc fetch allow for granular selection of targets it would be very helpful to be able to use dvc gc to remove those same targets from cache once we are done with them. In my case specifically, I have a few semi-independent datasets I would rather avoid having to keep in cache at the same time, but would like to be able to switch between for different analyses (and occasionally have both in cache for specific tasks).

Jaume-JCI commented 2 years ago

This feature would be very important in my projects as well.

dberenbaum commented 2 years ago

The only real blocker to this besides prioritization is that it's dangerous since it could delete something needed elsewhere, right? Could we add the command with a stern warning that it's dangerous?

Jaume-JCI commented 2 years ago

Thanks @dberenbaum. In the meanwhile, could you please comment on whether the following hack would corrupt the DVC setup in any way?

Imagine I have run

dvc pull my_data_folder.dvc

This will place the downloaded data into .dvc/cache, and it will create a set of soft links in my_data_folder (if you have configured DVC to use soft links), i.e., if we list the contents of the my_data_folder with

ls -l my_data_folder

We see something like:

my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
...

The idea is to be able to delete only specific files. By observing the hash that is displayed with the ls -l command, I can delete directly the corresponding files in the DVC cache. For instance, if we want to remove my_data_file_1.pk, I can do:

rm my_data_folder/my_data_file_1.pk
rm .dvc/cache/4f/7bc7702897bec7e0fae679e968d792

Later, if I want to download this file again, I can just do dvc pull my_data_folder.dvc again. Would that corrupt DVC? Do I need to instead delete all the files that are linked in my_data_folder, instead of just a single one?

Thanks!

dberenbaum commented 2 years ago

I think as long as you aren't editing the files in place, and are only dropping and adding files and pulling them, it should be safe. You may want to test on an example to be safe before trying on your real data.

Jaume-JCI commented 2 years ago

Thank you very much @dberenbaum.

jorgeorpinel commented 1 year ago

Currently gc works by blacklisting i.e. you specify what NOT to collect. Would it still be helpful as an option to specify certain target files to keep? Ideas:

$ dvc gc -w --outs data/abc.xml

Keep the current version of data/abc.xml (referenced in the workspace)

$ dvc gc -A -o *.dvc -o model.pt

Keep raw data (all outputs of all .dvc files) and a specific model file referenced in all commits (in essence this removes all intermediate artifacts which can always be reproduced anyway).

oadams commented 1 year ago

This would be very useful in cases where data has been pushed to the remote accidentally before setting cache: false / push: false in dvc.yaml. Currently it's quite difficult to selectively purge stuff that was accidentally pushed to the remote.

jeremyherr commented 1 year ago

Here's another use case. Our team been asked to remove all traces of one vendor's data from our systems, because our contract with them has ended. We removed all the relevant pipelines and source code, and we used dvc remove to remove any .dvc files we had from the codebase, but that vendor's data is still in the cache. Also, there are files tracked by dvc.lock files still in the cache. We could use gc to obliterate all old data, but for all our other data, we want to preserve old versions in the cache so that we can see exactly what changes were made between data versions. This can be useful when our data providers introduce new errors that break our pipelines.

Another thing that could be useful is to be able to garbage collect files more than one year old, for example. Or garbage collect all old versions except the current and previous one.

daavoo commented 1 year ago

Another thing that could be useful is to be able to garbage collect files more than one year old, for example. Or garbage collect all old versions except the current and previous one.

These 2 things should be possible today using:

https://dvc.org/doc/command-reference/gc#--date https://dvc.org/doc/command-reference/gc#--rev & https://dvc.org/doc/command-reference/gc#-n

jeremyherr commented 1 year ago

Thanks @daavoo , those will be useful!

asiron commented 4 days ago

Are there any plans to implement this feature ? There is essentially no way of deleting files from cache that you want to remove. For example, I have a datasets repo as well as a model repo that imports a bunch of data from datasets. The model repo created a lot of files during rapid development and easily over-bloated the shared cache and the storage. Let's say that we have now a more mature version and we would like to clean the cache and storage: