Open edwardwbarber opened 4 years ago
This feature would be very important in my projects as well.
The only real blocker to this besides prioritization is that it's dangerous since it could delete something needed elsewhere, right? Could we add the command with a stern warning that it's dangerous?
Thanks @dberenbaum. In the meanwhile, could you please comment on whether the following hack would corrupt the DVC setup in any way?
Imagine I have run
dvc pull my_data_folder.dvc
This will place the downloaded data into .dvc/cache
, and it will create a set of soft links in my_data_folder
(if you have configured DVC to use soft links), i.e., if we list the contents of the my_data_folder
with
ls -l my_data_folder
We see something like:
my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
...
The idea is to be able to delete only specific files. By observing the hash that is displayed with the ls -l
command, I can delete directly the corresponding files in the DVC cache. For instance, if we want to remove my_data_file_1.pk
, I can do:
rm my_data_folder/my_data_file_1.pk
rm .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
Later, if I want to download this file again, I can just do dvc pull my_data_folder.dvc
again. Would that corrupt DVC? Do I need to instead delete all the files that are linked in my_data_folder
, instead of just a single one?
Thanks!
I think as long as you aren't editing the files in place, and are only dropping and adding files and pulling them, it should be safe. You may want to test on an example to be safe before trying on your real data.
Thank you very much @dberenbaum.
Currently gc
works by blacklisting i.e. you specify what NOT to collect. Would it still be helpful as an option to specify certain target files to keep? Ideas:
$ dvc gc -w --outs data/abc.xml
Keep the current version of data/abc.xml (referenced in the workspace)
$ dvc gc -A -o *.dvc -o model.pt
Keep raw data (all outputs of all .dvc files) and a specific model file referenced in all commits (in essence this removes all intermediate artifacts which can always be reproduced anyway).
This would be very useful in cases where data has been pushed to the remote accidentally before setting cache: false
/ push: false
in dvc.yaml
. Currently it's quite difficult to selectively purge stuff that was accidentally pushed to the remote.
Here's another use case. Our team been asked to remove all traces of one vendor's data from our systems, because our contract with them has ended. We removed all the relevant pipelines and source code, and we used dvc remove to remove any .dvc files we had from the codebase, but that vendor's data is still in the cache. Also, there are files tracked by dvc.lock files still in the cache. We could use gc to obliterate all old data, but for all our other data, we want to preserve old versions in the cache so that we can see exactly what changes were made between data versions. This can be useful when our data providers introduce new errors that break our pipelines.
Another thing that could be useful is to be able to garbage collect files more than one year old, for example. Or garbage collect all old versions except the current and previous one.
Another thing that could be useful is to be able to garbage collect files more than one year old, for example. Or garbage collect all old versions except the current and previous one.
These 2 things should be possible today using:
https://dvc.org/doc/command-reference/gc#--date https://dvc.org/doc/command-reference/gc#--rev & https://dvc.org/doc/command-reference/gc#-n
Thanks @daavoo , those will be useful!
Are there any plans to implement this feature ? There is essentially no way of deleting files from cache that you want to remove. For example, I have a datasets repo as well as a model repo that imports a bunch of data from datasets. The model repo created a lot of files during rapid development and easily over-bloated the shared cache and the storage. Let's say that we have now a more mature version and we would like to clean the cache and storage:
dvc import
ed targets in data/
git push -d origin <branch>
)
Currently that is not possible and it seems to me like a simple use-case.
This has already been mentioned a few times in #2325 but wanted to draw attention again to this aspect specifically:
Since
dvc pull
anddvc fetch
allow for granular selection oftargets
it would be very helpful to be able to usedvc gc
to remove those sametargets
from cache once we are done with them. In my case specifically, I have a few semi-independent datasets I would rather avoid having to keep in cache at the same time, but would like to be able to switch between for different analyses (and occasionally have both in cache for specific tasks).