iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.63k stars 1.18k forks source link

DVC Repro incorrectly saving directory to cache #4428

Closed digitalillusions closed 1 year ago

digitalillusions commented 4 years ago

Bug Report

UPDATE: Skip to https://github.com/iterative/dvc/issues/4428#issuecomment-778259232

As the title states. I run dvc repro extract for my extract pipeline stage. This takes two pretty large zip files and extracts them into specified folders. These folders should not be added to the dvc cache, since they can be easily reproduced by extracting the archives, but I declare them as dependencies so that the DAG looks nicer.

My dvc.yaml looks like this. The preprocess stage should only indicate, that the not cached folders should be used as dependencies in later stages.

stages:
  extract:
    cmd: tar -xzvf data/thingy10k/10k_tetmesh.tar.gz -C data/thingy10k/ && tar -xzvf data/thingy10k/10k_surface.tar.gz -C data/thingy10k/
    deps:
    - data/thingy10k/10k_surface.tar.gz
    - data/thingy10k/10k_tetmesh.tar.gz
    outs:
    - data/thingy10k/10k_surface:
        cache: false
    - data/thingy10k/10k_tetmesh:
        cache: false
  preprocess:
    cmd: some preprocessing command
    deps:
    - data/thingy10k/10k_tetmesh
    - data/thingy10k/10k_surface

After running dvc repro extract, the hashes of all the files are computed and then the files saved to cache. This is exactly the thing that I was trying to prevent with the cache: false option.

I confirmed that the contents of the output folders were indeed added to the cache by using du -sh .dvc/cache, which went up by exactly the size of the two folders after running the command.

Interestingly, after running dvc gc -a the cache is freed again. Also running dvc push (without first running dvc gc -a) to push the cache to my remote storage also says everything is up to date, which leads me to believe that dvc recognizes that these directories should in fact not be cached.

I have reproduced this in a local git and dvc repo by first adding the two archives using dvc add and then running the above mentioned extract stage. The archives can be downloaded from this repo https://github.com/Yixin-Hu/TetWild#dataset under Output.

Please provide information about your setup

Output of dvc version:

$ dvc version
DVC version: 1.6.0 (pip)
---------------------------------
Platform: Python 3.7.6 on Linux-5.4.0-42-generic-x86_64-with-debian-bullseye-sid
Supports: http, https, ssh, webdav, webdavs
Cache types: hardlink, symlink
Repo: dvc, git

Additional Information (if any):

If applicable, please also provide a --verbose output of the command, eg: dvc add --verbose.

efiop commented 4 years ago

Hi @digitalillusions ! Thank you for reporting this issue! This is a result of our run-cache feature, that started caching even unchaged outputs to be able to restore stages when needed. We definitely had small git-tracked files in mind, but clearly it is not the case here. We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for dvc run that would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache. We already have dvc run --no-run-cache that tells dvc not to use existing run-cache, but it still tries to save the results of this run :slightly_frowning_face:

digitalillusions commented 4 years ago

Thanks @efiop for the response! Ah thats right, I remember reading about this in the 1.0 update. I also never noticed before, because I set up my pipelines before 1.0 came out and only now made a clean clone of my repo somewhere else and used the repro functionality to reproduce the state.

We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for dvc run that would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache.

I think that would be a good idea. Though to me it would still be somewhat confusing having to, e.g. set both cache: false, run-cache: false in my dvc.yaml. Or would you specify the run-cache: false for the entire pipeline stage? Or only specify run-cache: false which would imply cache: false?

To me it was also just super unintuitive and frustrating to see 12Gb of data being saved to .dvc/cache, when I obviously set the cache flag to false.

efiop commented 4 years ago

@digitalillusions Yeah, I meant run_cache: False(or something like that) for the whole stage, rather than for every output. That might be useful for other applications as well.

digitalillusions commented 4 years ago

But would you then in addition have to specify cache: false for the individual outputs? Other than that seems like a nice solution to me.

efiop commented 4 years ago

@digitalillusions Correct, we would. The reason is that cache: false is solely for data management purposes, meaning that it won't be affected by dvc checkout/pull/push/etc. So in terms of a dvc run it would look something like dvc run --no-save-run-cache -O my_uncached_out .... Don't like the --no-save-run-cache naming though, will need to figure something out.

shcheklein commented 4 years ago

The reason is that cache: false is solely for data management purposes

Yeah, cache: became a bit misleading name now :( No ideas, how to rename or how to name new options yet though ...

digitalillusions commented 4 years ago

Would it be an option to provide keyword arguments? So instead of having --no-run-cache, you could have --run-cache (ignore | disable) where ignore would be the old --no-run-cache and disable would be to completely disable the run cache for that specific stage.

efiop commented 4 years ago

@digitalillusions Great idea! Yeah, could think about using something like that. :+1:

pared commented 3 years ago

I think we should bump the priority on this issue. It seems like no-cache effectively does not exist since 1.0:

#!/bin/bash

rm -rf repo
mkdir repo
pushd repo

git init --quiet
dvc init --quiet

echo hello >> hello
dvc add hello
git add -A
git commit -am "init"

dvc run -n run -d hello -O hello.txt --no-run-cache "cat hello >> hello.txt && echo 'modification' >> hello.txt"

Results in saving hello.txt to cache.

jimmytudeski commented 3 years ago

Hello :) I have the same kind of problem: A preprocessing stage that takes a big folder as input and produces a big folder as output. It would be great if there was a easy and fine grained way of specifying where the output of a stage ends up being. As far as I can see there are the following options for an output of stage:

I understand that the run-cache is useful. So what about the following:

Would this be possible? Thanks

efiop commented 3 years ago

@jimmytudeski Unfortunately there is no LRU functionality in our cache yet, so we would need to figure it out first. A faster, and arguably more proper fix for this is to pass run_cache argument to stage.save and if not run_cache not call stage_cache.save https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L439 . Stage.run already receives it https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L488 , but only passes it to run_stage, but it should also pass it to stage.save(). That will make dvc run --no-run-cache and dvc repro --no-run-cache not save the run cache for particular stages. Though granularity still requires run_cache: true/false per-output, as described in my other comment above :(

efiop commented 3 years ago

Had a very good discussion with @dberenbaum yesterday, and there are a few clear points:

1) the run-cache name adds to the confusion with our regular local cache. If there are any good alternative names, we'll be happy to consider them. IIRC initially we called it stage cache, but it was before we've acknowledged the "stage" terminology before 1.0 release.

2) we def need stage-level flag in dvc.yaml to tell dvc not to save or restore from run-cache. E.g.

stages:
  ...
  - mystage:
      cmd: mycmd
      run_cache: false

this way the changes will be permanent.

3) the cli flags could be improved. E.g.

Overall, we could definitely introduce run_cache: true/false for stages in dvc.yaml right now, even without corresponding CLI flag, so people could start using it. But in CLI, we could remove --no-run-cache flag in both run/repro, make --force in run/repro ignore run-cache and introduce --no-run-cache to dvc stage add, that will set run_cache: true/false flag.

CC @dberenbaum @jorgeorpinel ^

digitalillusions commented 3 years ago

the run-cache name adds to the confusion with our regular local cache. If there are any good alternative names, we'll be happy to consider them. IIRC initially we called it stage cache, but it was before we've acknowledged the "stage" terminology before 1.0 release.

Maybe a good name would be something more verbose? Since iirc the run_cache caches versions of the output without commiting them, a name like output_cache, cache_output or cache_output_history would be suitable?

Overall, we could definitely introduce run_cache: true/false for stages in dvc.yaml right now, even without corresponding CLI flag, so people could start using it. But in CLI, we could remove --no-run-cache flag in both run/repro, make --force in run/repro ignore run-cache and introduce --no-run-cache to dvc stage add, that will set run_cache: true/false flag.

So if i understand correctly, in essence the config.yaml would gain the run_cache in addition to the "normal" cache flag, which would indicate whether to save the outputs to run cache even if they are not saved to normal cache? To me this would make sense and be fine, though I would almost suggest making the run cache opt in instead of opt out. So if cache: False then run_cache: False would mimic this. For more granular control you could then specify run_cache: True if you were to want the run cache. Or requiring the user to do more work, you could always require specifying both cache and run_cache in the config.yaml.

Sorry if this comment is a bit confusing, just trying to throw some ideas around. Also when I first encountered the bug I was completely confused by the default behavior of dvc saving my 10GB tar file to cache even though I obviously specified cache: False to prevent this, which I think could happen to other new users as well.

efiop commented 3 years ago

@digitalillusions Note that run_cache: false option is stage-level, not output-level, as not saving particular outputs to run_cache is quite a bit harder to justify, as it won't provide full "restoration" that way.

The reason why we have it enabled by default for cache: false outputs is that we had metrics files in mind, which we assume people keep in git for easier navigation without dvc. We could try to be smarter about it and only allow saving cache: false to run-cache if those are metrics and not just regular outputs. This makes the defaults a bit more convoluted but seems to make sense from the high-level perspective. What do you think?

digitalillusions commented 3 years ago

@digitalillusions Note that run_cache: false option is stage-level, not output-level, as not saving particular outputs to run_cache is quite a bit harder to justify, as it won't provide full "restoration" that way.

@efiop Sure that sounds sensible to me. In any case, the user could kind of control run-cache output granularity by just having dummy stages that then save respective outputs to run-cache?

The reason why we have it enabled by default for cache: false outputs is that we had metrics files in mind, which we assume people keep in git for easier navigation without dvc. We could try to be smarter about it and only allow saving cache: false to run-cache if those are metrics and not just regular outputs. This makes the defaults a bit more convoluted but seems to make sense from the high-level perspective. What do you think?

I agree that this makes things too convoluted. If users have become accustomed to their outputs/ metrics being saved to run-cache even when cache: False so be it. Then I would suggest leaving run-cache: True when cache: False but making an exception for the metrics files seems strange to me. I just thought that opt-in to the run-cache functionality is more intuitive when disabling the basic cache.

As a different thought: How about requiring the specification of run-cache to True or False as soon as cache: False is encountered in a stage somewhere? This would require the user to make a decision, even if there is just a prompt telling them to use run-cache: True if they dont understand whats going on.

dberenbaum commented 3 years ago

Great discussion! Sorry for being so late in replying, but there's a lot to sort through here.

Current state recap

Seems like the operations of interest here are:

  1. Saving outputs to cache
  2. Restoring outputs from cache
  3. Saving run info to cache
  4. Restoring run info from cache

And currently the way these are handled in dvc repro is:

  1. By default, all 4 operations are enabled on repro.
  2. If dvc repro --no-run-cache is executed, 4 is disabled.
  3. There is no way to disable 1, 2, or 3.

Let me know if any of that is incorrect.

Thoughts/suggestions

dvc repro options

There could be circumstances where a user wants to run a pipeline one time and only do one of restore/save. In those cases, my intuition would be:

  1. dvc repro -f should disable restoring the run or outputs.
  2. dvc repro --no-commit should disable saving the run or outputs.

Neither of those seem to make sense as a consistent setup for a pipeline (there's no point in consistently performing only one of save/restore), so I think having them available only as dvc repro flags makes sense.

dvc.yaml options

  1. The cache field for outputs currently is designed with small metrics-like outputs in mind. It seems that the objective is not to avoid saving or restoring these outputs (since they are small and are currently being cached anyway), but instead to keep them tracked with git. What if there were a git or gitignore flag that determined whether those files are put into .gitignore but not whether they are cached?
  2. For a use case like a non-deterministic stage, users may want to consistently not save/restore runs, so a stage-level field like run-cache: false (or better named alternative) field in dvc.yaml makes sense to disable both saving and restoring runs.
  3. What if the stage-level field was cache just like the field for individual outputs? This would set the default cache value for all outputs under that stage and control whether the runs get cached. Then, users could still opt in/out of caching individual outputs for that stage. If the stage is cache: true and the output is cache: false, dvc could warn that the stage will not be able to be restored (or even prompt users before proceeding like suggested above).
jorgeorpinel commented 3 years ago
  1. run-cache name adds to the confusion with our regular local cache a name like output_cache, cache_output or cache_output_history would be suitable?

I've been explaining it as "log of stage runs". For ex. see https://dvc.org/doc/user-guide/project-structure/internal-files#run-cache. It seems like more of a registry or log to me (part of the project cache, but not a cache on it's own).

  1. run_cache: true/false for dvc.yaml

@efiop I'm not sure it's needed. I also think that a cache: false out should deactivate logging runs of that parent stage: No point in having the run signature and dvc.lock saved without all of the outputs cached. (I didn't get how metrics/plots change this logic, BTW.)

  1. current --no-run-cache is actually --no-restore-run-cache make--force in run/repro ignore run-cache and introduce --no-run-cache to dvc stage add

I agree that repro --force should include not restoring from cache 👍. But what about preventing repro/run from logging runs (if the stage doesn't have cache: false) ? I'd consider keeping --no-run-cahce for that case (+ make it include -f). Edit: dberenbaum's suggestion is better. Adding it to dvc stage add makes sense too 👍.

jorgeorpinel commented 3 years ago

There is no way to disable 1, 2, or 3.

@dberenbaum there's --no-commit for 1 in (as you later mention). And 4 already causes 2 🙂

circumstances where a user wants to run a pipeline one time and only do one of restore/save

Hm... you mean only restore/save missing stage outputs? But some stage outputs may invalidate others. Plain repro already does all those checks (-f is only for very special situations I think). And if you only want to restore missing outputs, you can use dvc checkout. Maybe I didn't get the idea correctly.

dvc repro --no-commit should disable saving the run or outputs.

I like that actually! No need for --no-run-cache in repro (simple=better). Unless we think users could want to cache outputs (e.g. to push them to remote storage) yet not log the run for some reason (can't think of one).

dvc.yaml options

  1. I think cache: false in dvc.yaml has a more general purpose but I'm not sure what the original intention was. Interesting idea about a git: false field, although I incline towards keeping the schema simpler if possible. But maybe rename cache to commit (as in run --no-commit) — somewhat unrelated here.
  2. Good point. I still think cache/commit :false is enough, but also see no problem in adding it to the stage level (3. whatever name is used).
dberenbaum commented 3 years ago

I've been explaining it as "log of stage runs". For ex. see https://dvc.org/doc/user-guide/project-structure/internal-files#run-cache. It seems like more of a registry or log to me (part of the project cache, but not a cache on it's own).

I like this.

@dberenbaum there's --no-commit for 1 in (as you later mention).

I would think so, but that's not how it behaves now. Try the example in https://github.com/iterative/dvc/issues/4428#issuecomment-741618372 and add --no-commit, and you will still see hello.txt saved to the cache!

  • I think cache: false in dvc.yaml has a more general purpose but I'm not sure what the original intention was. Interesting idea about a git: false field, although I incline towards keeping the schema simpler if possible. But maybe rename cache to commit (as in run --no-commit) — somewhat unrelated here.

  • Good point. I still think cache/commit :false is enough, but also see no problem in adding it to the stage level (3. whatever name is used).

You seem to be suggesting what @digitalillusions suggested (disable run-cache if any output has cache: false), which seems like a sensible default if wanting to keep the interface as simple as possible but provides a little less flexibility.

For metrics, users may want to cache the results with dvc so that they are tied to a particular dvc run, but they may also want to keep them tracked with git so that it's easy to see results without dvc (for example, browsing in Github). For use cases like the original one in this issue, users may want dvc to visualize their entire process without actually caching certain outputs, but they may also want to ignore those outputs in git since they would be way too big for git. These could both probably be accomplished by manually updating .gitignore, but that seems clunky.

Seems like there are at least 2 viable options, both of which are likely preferable to the current state. @efiop Interested in your thoughts when you have a chance.

jorgeorpinel commented 3 years ago

that's not how it behaves now

I see. But that looks like a bug which should be fixed soon (hopefully).

suggesting what digitalillusions suggested (disable run-cache if any output has cache: false

Yes. I see no point in logging a run with incomplete outputs in cache.

they may also want to keep them tracked with git so that it's easy to see results without dvc (for example, browsing in Github

OK I could see that although it's a kind of specific use case (are there user requests along this line?) We do have dvc list for this and perhaps another "bun in the oven"? 😉

dberenbaum commented 3 years ago

Good points, @jorgeorpinel , and I think disabling the run-cache if any output has cache: false is less disruptive for existing workflows than changing the syntax, so I'm fine with that suggestion as long as others don't have concerns. I'd be interested in feedback from @digitalillusions if you are still paying attention!

digitalillusions commented 3 years ago

@dberenbaum I'm still here and paying close attention! I would be very happy with a solution which disables the run-cache if cache: False, since this addresses my original issue and is is my opinion the most intuitive default behavior. I guess the --no-commit thing is a different issue?

Especially when using large datasets which can be downloaded and dont want to be kept in dvc cache, the -O option should not suddenly start saving these files to run-cache. Of course you could argue that these dont need to be part of pipeline inputs and outputs, but the dvc mechanism still works great for versioning these large datasets even if they are not present in the cache.

Would the current consensus be to add a stage level run-cache option which saves all outputs to run-cache if run-cache: True even if there are cache: False in the dvc.yaml file? The default behavior would be run-cache: False, as mentioned above, as soon as cache: False for any output? With this the choice for run cache would be all outputs or no outputs. This would also be the behavior that makes the most sense, given that the run-cache can be used to restore the entire state of a specific stage, instead of allowing the user to only save some outputs to run-cache. Did I get this right?

dberenbaum commented 3 years ago

Yes, the default would be to disable the run cache if cache: false for any output. I don't know that a stage-level run-cache is needed. If cache: true for all outputs, I don't think there's any reason not to cache the run, and if cache: false for any output, I don't think there's any reason to cache the run.

For anyone using cache: false on metrics-like outputs, they likely assume that those metrics outputs are not being cached by DVC, so I don't think there's much harm in disabling the run cache. It might be nice to give the option to have DVC cache those outputs and runs yet still have them tracked by git, but it seems like the consensus is that the added complexity of doing so may not be worthwhile.

dberenbaum commented 3 years ago

So to summarize all changes being proposed, the above comment explains the suggested behavior for the cache option in dvc.yaml.

The other suggested changes are:

  • dvc repro -f should disable restoring the run or outputs.

  • dvc repro --no-commit should disable saving the run or outputs.

dberenbaum commented 3 years ago

If cache: false is set because some data is sensitive, current behavior could be dangerous. If someone does dvc push --run-cache, for example, they will push the cache: false data to the remote.

GlaIZier commented 2 years ago

Hi and have a good day!

I've stumbled upon a weird behavior related to this issue.

Consider this dvc.yaml:

stages:
  first:
    cmd: echo "1" | tee 1.txt
    outs:
      - 1.txt:
          cache: false

In this case, everything works well: local cache is empty.

However, if I added more stages dependent on the previous ones:

stages:
  first:
    cmd: echo "1" | tee 1.txt
    outs:
      - 1.txt:
          cache: false
  second:
    cmd: echo "22" | tee 2.txt
    deps:
      - 1.txt
    outs:
      - 2.txt:
          cache: false
  third:
    cmd: echo "333" | tee 3.txt
    deps:
      - 2.txt
    outs:
      - 3.txt:
          cache: false

I got files related to the second and the third stages in the local cache along with the 'runs' folder. Why is it like this? Is a proper behavior or a bug? As far as I learned from this thread, it shouldn't be like this.

pared commented 2 years ago

Hello @GlaIZier! I am afraid that you are right, in this case --no-run-cache option added to repro should disable the run cache but it does not (as in example I posted some time ago: https://github.com/iterative/dvc/issues/4428#issuecomment-741618372) We still didn't found time to get to this bug. How severe it is to you?

dberenbaum commented 2 years ago

@pared For the first stage above, is the run cache not triggered because there are no dependencies?

dberenbaum commented 2 years ago

Unfortunately, we haven't gotten to this one for a long time. I'm lowering the priority, and hopefully we will get back to this when we start refactoring pipeline/stage code.

alkatar21 commented 1 year ago

@dberenbaum

@pared For the first stage above, is the run cache not triggered because there are no dependencies?

Yes, I noticed this while creating an example for #8582.

davide-chiuchiu commented 1 year ago

HI everybody, I just found out this behavior while trying to work on a pipeline that should handle sensitive data, and it has become a mild headache.

In my use case. I run a dvc pipeline where the first step that fetches data from a db, and the next steps run some analysis on the fetched data. The results of the analysis can be cached and then pushed to dvc remotes, but the source data should under no circumstance exit the local machine in which they have been queried. I thought that setting cache: false for the output of the fetching step was enough to prevent dvc from caching, but I keep finding them in .dvc/cache after running dvc repro. This makes me very uneasy in running dvc push as I might unintentionally be contributing to a data leak. Do you happen to have a way to handle such scenario?

PS: I thought to remove the queried data from the outputs of the fetching step, but that would make my pipeline a bunch of unconnected steps. Feasible but sub-optimal given dvc usefulness in handling data flows.

daavoo commented 1 year ago

Do you happen to have a way to handle such scenario?

Hi @davide-chiuchiu, we are discussing solutions for handling cache: fasle on a draft P.R. here: https://github.com/iterative/dvc/pull/8738 .

davide-chiuchiu commented 1 year ago

Do you happen to have a way to handle such scenario?

Hi @davide-chiuchiu, we are discussing solutions for handling cache: fasle on a draft P.R. here: #8738 .

That looks very useful. I will track it to known when we can use it. Thank you!