Closed digitalillusions closed 1 year ago
Hi @digitalillusions ! Thank you for reporting this issue! This is a result of our run-cache feature, that started caching even unchaged outputs to be able to restore stages when needed. We definitely had small git-tracked files in mind, but clearly it is not the case here. We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for dvc run
that would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache. We already have dvc run --no-run-cache
that tells dvc not to use existing run-cache, but it still tries to save the results of this run :slightly_frowning_face:
Thanks @efiop for the response! Ah thats right, I remember reading about this in the 1.0 update. I also never noticed before, because I set up my pipelines before 1.0 came out and only now made a clean clone of my repo somewhere else and used the repro functionality to reproduce the state.
We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for dvc run that would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache.
I think that would be a good idea. Though to me it would still be somewhat confusing having to, e.g. set both cache: false, run-cache: false
in my dvc.yaml
. Or would you specify the run-cache: false
for the entire pipeline stage? Or only specify run-cache: false
which would imply cache: false
?
To me it was also just super unintuitive and frustrating to see 12Gb of data being saved to .dvc/cache
, when I obviously set the cache
flag to false.
@digitalillusions Yeah, I meant run_cache: False
(or something like that) for the whole stage, rather than for every output. That might be useful for other applications as well.
But would you then in addition have to specify cache: false
for the individual outputs? Other than that seems like a nice solution to me.
@digitalillusions Correct, we would. The reason is that cache: false
is solely for data management purposes, meaning that it won't be affected by dvc checkout/pull/push/etc
. So in terms of a dvc run
it would look something like dvc run --no-save-run-cache -O my_uncached_out ...
. Don't like the --no-save-run-cache
naming though, will need to figure something out.
The reason is that cache: false is solely for data management purposes
Yeah, cache:
became a bit misleading name now :( No ideas, how to rename or how to name new options yet though ...
Would it be an option to provide keyword arguments? So instead of having --no-run-cache
, you could have --run-cache (ignore | disable)
where ignore
would be the old --no-run-cache
and disable
would be to completely disable the run cache for that specific stage.
@digitalillusions Great idea! Yeah, could think about using something like that. :+1:
I think we should bump the priority on this issue. It seems like no-cache
effectively does not exist since 1.0:
#!/bin/bash
rm -rf repo
mkdir repo
pushd repo
git init --quiet
dvc init --quiet
echo hello >> hello
dvc add hello
git add -A
git commit -am "init"
dvc run -n run -d hello -O hello.txt --no-run-cache "cat hello >> hello.txt && echo 'modification' >> hello.txt"
Results in saving hello.txt
to cache.
Hello :) I have the same kind of problem: A preprocessing stage that takes a big folder as input and produces a big folder as output. It would be great if there was a easy and fine grained way of specifying where the output of a stage ends up being. As far as I can see there are the following options for an output of stage:
I understand that the run-cache is useful. So what about the following:
dvc run -o
-> dvc stores full outputdvc run -O5
-> dvc stores hashsums and the last 5 versions of the output temporary. For run 6, dvc deletes output of run 1.dvc run -O
-> same as nowWould this be possible? Thanks
@jimmytudeski Unfortunately there is no LRU functionality in our cache yet, so we would need to figure it out first.
A faster, and arguably more proper fix for this is to pass run_cache
argument to stage.save
and if not run_cache
not call stage_cache.save
https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L439 . Stage.run
already receives it https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L488 , but only passes it to run_stage
, but it should also pass it to stage.save()
. That will make dvc run --no-run-cache
and dvc repro --no-run-cache
not save the run cache for particular stages. Though granularity still requires run_cache: true/false
per-output, as described in my other comment above :(
Had a very good discussion with @dberenbaum yesterday, and there are a few clear points:
1) the run-cache
name adds to the confusion with our regular local cache. If there are any good alternative names, we'll be happy to consider them. IIRC initially we called it stage cache
, but it was before we've acknowledged the "stage" terminology before 1.0 release.
2) we def need stage-level flag in dvc.yaml to tell dvc not to save or restore from run-cache. E.g.
stages:
...
- mystage:
cmd: mycmd
run_cache: false
this way the changes will be permanent.
3) the cli flags could be improved. E.g.
--force
in run/repro should probably imply --no-run-cache
--no-run-cache
is actually --no-restore-run-cache
and might need to be renamed because of stage-level run_cache: true/false
flag.Overall, we could definitely introduce run_cache: true/false
for stages in dvc.yaml right now, even without corresponding CLI flag, so people could start using it. But in CLI, we could remove --no-run-cache
flag in both run/repro, make --force
in run/repro ignore run-cache and introduce --no-run-cache
to dvc stage add
, that will set run_cache: true/false
flag.
CC @dberenbaum @jorgeorpinel ^
the run-cache name adds to the confusion with our regular local cache. If there are any good alternative names, we'll be happy to consider them. IIRC initially we called it stage cache, but it was before we've acknowledged the "stage" terminology before 1.0 release.
Maybe a good name would be something more verbose? Since iirc the run_cache
caches versions of the output without commiting them, a name like output_cache
, cache_output
or cache_output_history
would be suitable?
Overall, we could definitely introduce run_cache: true/false for stages in dvc.yaml right now, even without corresponding CLI flag, so people could start using it. But in CLI, we could remove --no-run-cache flag in both run/repro, make --force in run/repro ignore run-cache and introduce --no-run-cache to dvc stage add, that will set run_cache: true/false flag.
So if i understand correctly, in essence the config.yaml
would gain the run_cache
in addition to the "normal" cache
flag, which would indicate whether to save the outputs to run cache even if they are not saved to normal cache? To me this would make sense and be fine, though I would almost suggest making the run cache opt in instead of opt out. So if cache: False
then run_cache: False
would mimic this. For more granular control you could then specify run_cache: True
if you were to want the run cache. Or requiring the user to do more work, you could always require specifying both cache
and run_cache
in the config.yaml.
Sorry if this comment is a bit confusing, just trying to throw some ideas around. Also when I first encountered the bug I was completely confused by the default behavior of dvc saving my 10GB tar file to cache even though I obviously specified cache: False
to prevent this, which I think could happen to other new users as well.
@digitalillusions Note that run_cache: false
option is stage-level, not output-level, as not saving particular outputs to run_cache is quite a bit harder to justify, as it won't provide full "restoration" that way.
The reason why we have it enabled by default for cache: false
outputs is that we had metrics files in mind, which we assume people keep in git for easier navigation without dvc. We could try to be smarter about it and only allow saving cache: false
to run-cache if those are metrics and not just regular outputs. This makes the defaults a bit more convoluted but seems to make sense from the high-level perspective. What do you think?
@digitalillusions Note that run_cache: false option is stage-level, not output-level, as not saving particular outputs to run_cache is quite a bit harder to justify, as it won't provide full "restoration" that way.
@efiop Sure that sounds sensible to me. In any case, the user could kind of control run-cache output granularity by just having dummy stages that then save respective outputs to run-cache?
The reason why we have it enabled by default for cache: false outputs is that we had metrics files in mind, which we assume people keep in git for easier navigation without dvc. We could try to be smarter about it and only allow saving cache: false to run-cache if those are metrics and not just regular outputs. This makes the defaults a bit more convoluted but seems to make sense from the high-level perspective. What do you think?
I agree that this makes things too convoluted. If users have become accustomed to their outputs/ metrics being saved to run-cache even when cache: False
so be it. Then I would suggest leaving run-cache: True
when cache: False
but making an exception for the metrics files seems strange to me. I just thought that opt-in to the run-cache functionality is more intuitive when disabling the basic cache.
As a different thought: How about requiring the specification of run-cache
to True or False as soon as cache: False
is encountered in a stage somewhere? This would require the user to make a decision, even if there is just a prompt telling them to use run-cache: True
if they dont understand whats going on.
Great discussion! Sorry for being so late in replying, but there's a lot to sort through here.
Seems like the operations of interest here are:
And currently the way these are handled in dvc repro is:
dvc repro --no-run-cache
is executed, 4 is disabled.Let me know if any of that is incorrect.
There could be circumstances where a user wants to run a pipeline one time and only do one of restore/save. In those cases, my intuition would be:
dvc repro -f
should disable restoring the run or outputs.dvc repro --no-commit
should disable saving the run or outputs.Neither of those seem to make sense as a consistent setup for a pipeline (there's no point in consistently performing only one of save/restore), so I think having them available only as dvc repro
flags makes sense.
cache
field for outputs currently is designed with small metrics-like outputs in mind. It seems that the objective is not to avoid saving or restoring these outputs (since they are small and are currently being cached anyway), but instead to keep them tracked with git. What if there were a git
or gitignore
flag that determined whether those files are put into .gitignore
but not whether they are cached? run-cache: false
(or better named alternative) field in dvc.yaml
makes sense to disable both saving and restoring runs.cache
just like the field for individual outputs? This would set the default cache
value for all outputs under that stage and control whether the runs get cached. Then, users could still opt in/out of caching individual outputs for that stage. If the stage is cache: true
and the output is cache: false
, dvc could warn that the stage will not be able to be restored (or even prompt users before proceeding like suggested above).
- run-cache name adds to the confusion with our regular local cache a name like output_cache, cache_output or cache_output_history would be suitable?
I've been explaining it as "log of stage runs". For ex. see https://dvc.org/doc/user-guide/project-structure/internal-files#run-cache. It seems like more of a registry or log to me (part of the project cache, but not a cache on it's own).
run_cache: true/false
for dvc.yaml
@efiop I'm not sure it's needed. I also think that a cache: false
out should deactivate logging runs of that parent stage: No point in having the run signature and dvc.lock saved without all of the outputs cached. (I didn't get how metrics/plots change this logic, BTW.)
- current
--no-run-cache
is actually--no-restore-run-cache
make--force
inrun/repro
ignore run-cache and introduce--no-run-cache
todvc stage add
I agree that repro --force
should include not restoring from cache 👍.
But what about preventing Edit: dberenbaum's suggestion is better.
Adding it to repro/run
from logging runs (if the stage doesn't have cache: false
) ? I'd consider keeping --no-run-cahce
for that case (+ make it include -f
).dvc stage add
makes sense too 👍.
There is no way to disable 1, 2, or 3.
@dberenbaum there's --no-commit
for 1 in (as you later mention). And 4 already causes 2 🙂
circumstances where a user wants to run a pipeline one time and only do one of restore/save
Hm... you mean only restore/save missing stage outputs? But some stage outputs may invalidate others. Plain repro
already does all those checks (-f
is only for very special situations I think). And if you only want to restore missing outputs, you can use dvc checkout
. Maybe I didn't get the idea correctly.
dvc repro --no-commit
should disable saving the run or outputs.
I like that actually! No need for --no-run-cache
in repro
(simple=better). Unless we think users could want to cache outputs (e.g. to push them to remote storage) yet not log the run for some reason (can't think of one).
dvc.yaml options
cache: false
in dvc.yaml has a more general purpose but I'm not sure what the original intention was. Interesting idea about a git: false
field, although I incline towards keeping the schema simpler if possible. But maybe rename cache
to commit
(as in run --no-commit
) — somewhat unrelated here.cache/commit :false
is enough, but also see no problem in adding it to the stage level (3. whatever name is used).I've been explaining it as "log of stage runs". For ex. see https://dvc.org/doc/user-guide/project-structure/internal-files#run-cache. It seems like more of a registry or log to me (part of the project cache, but not a cache on it's own).
I like this.
@dberenbaum there's
--no-commit
for 1 in (as you later mention).
I would think so, but that's not how it behaves now. Try the example in https://github.com/iterative/dvc/issues/4428#issuecomment-741618372 and add --no-commit
, and you will still see hello.txt
saved to the cache!
I think
cache: false
in dvc.yaml has a more general purpose but I'm not sure what the original intention was. Interesting idea about agit: false
field, although I incline towards keeping the schema simpler if possible. But maybe renamecache
tocommit
(as inrun --no-commit
) — somewhat unrelated here.Good point. I still think
cache/commit :false
is enough, but also see no problem in adding it to the stage level (3. whatever name is used).
You seem to be suggesting what @digitalillusions suggested (disable run-cache if any output has cache: false
), which seems like a sensible default if wanting to keep the interface as simple as possible but provides a little less flexibility.
For metrics, users may want to cache the results with dvc so that they are tied to a particular dvc run, but they may also want to keep them tracked with git so that it's easy to see results without dvc (for example, browsing in Github). For use cases like the original one in this issue, users may want dvc to visualize their entire process without actually caching certain outputs, but they may also want to ignore those outputs in git since they would be way too big for git. These could both probably be accomplished by manually updating .gitignore
, but that seems clunky.
Seems like there are at least 2 viable options, both of which are likely preferable to the current state. @efiop Interested in your thoughts when you have a chance.
that's not how it behaves now
I see. But that looks like a bug which should be fixed soon (hopefully).
suggesting what digitalillusions suggested (disable run-cache if any output has cache: false
Yes. I see no point in logging a run with incomplete outputs in cache.
they may also want to keep them tracked with git so that it's easy to see results without dvc (for example, browsing in Github
OK I could see that although it's a kind of specific use case (are there user requests along this line?) We do have dvc list
for this and perhaps another "bun in the oven"? 😉
Good points, @jorgeorpinel , and I think disabling the run-cache
if any output has cache: false
is less disruptive for existing workflows than changing the syntax, so I'm fine with that suggestion as long as others don't have concerns. I'd be interested in feedback from @digitalillusions if you are still paying attention!
@dberenbaum I'm still here and paying close attention! I would be very happy with a solution which disables the run-cache
if cache: False
, since this addresses my original issue and is is my opinion the most intuitive default behavior. I guess the --no-commit
thing is a different issue?
Especially when using large datasets which can be downloaded and dont want to be kept in dvc cache, the -O
option should not suddenly start saving these files to run-cache
. Of course you could argue that these dont need to be part of pipeline inputs and outputs, but the dvc mechanism still works great for versioning these large datasets even if they are not present in the cache.
Would the current consensus be to add a stage level run-cache
option which saves all outputs to run-cache
if run-cache: True
even if there are cache: False
in the dvc.yaml
file? The default behavior would be run-cache: False
, as mentioned above, as soon as cache: False
for any output? With this the choice for run cache would be all outputs or no outputs. This would also be the behavior that makes the most sense, given that the run-cache
can be used to restore the entire state of a specific stage, instead of allowing the user to only save some outputs to run-cache
. Did I get this right?
Yes, the default would be to disable the run cache if cache: false
for any output. I don't know that a stage-level run-cache
is needed. If cache: true
for all outputs, I don't think there's any reason not to cache the run, and if cache: false
for any output, I don't think there's any reason to cache the run.
For anyone using cache: false
on metrics-like outputs, they likely assume that those metrics outputs are not being cached by DVC, so I don't think there's much harm in disabling the run cache. It might be nice to give the option to have DVC cache those outputs and runs yet still have them tracked by git, but it seems like the consensus is that the added complexity of doing so may not be worthwhile.
So to summarize all changes being proposed, the above comment explains the suggested behavior for the cache
option in dvc.yaml
.
The other suggested changes are:
dvc repro -f
should disable restoring the run or outputs.
dvc repro --no-commit
should disable saving the run or outputs.
If cache: false
is set because some data is sensitive, current behavior could be dangerous. If someone does dvc push --run-cache
, for example, they will push the cache: false
data to the remote.
Hi and have a good day!
I've stumbled upon a weird behavior related to this issue.
Consider this dvc.yaml:
stages:
first:
cmd: echo "1" | tee 1.txt
outs:
- 1.txt:
cache: false
In this case, everything works well: local cache is empty.
However, if I added more stages dependent on the previous ones:
stages:
first:
cmd: echo "1" | tee 1.txt
outs:
- 1.txt:
cache: false
second:
cmd: echo "22" | tee 2.txt
deps:
- 1.txt
outs:
- 2.txt:
cache: false
third:
cmd: echo "333" | tee 3.txt
deps:
- 2.txt
outs:
- 3.txt:
cache: false
I got files related to the second and the third stages in the local cache along with the 'runs' folder. Why is it like this? Is a proper behavior or a bug? As far as I learned from this thread, it shouldn't be like this.
Hello @GlaIZier! I am afraid that you are right, in this case --no-run-cache
option added to repro
should disable the run cache but it does not (as in example I posted some time ago: https://github.com/iterative/dvc/issues/4428#issuecomment-741618372) We still didn't found time to get to this bug. How severe it is to you?
@pared For the first stage above, is the run cache not triggered because there are no dependencies?
Unfortunately, we haven't gotten to this one for a long time. I'm lowering the priority, and hopefully we will get back to this when we start refactoring pipeline/stage code.
@dberenbaum
@pared For the first stage above, is the run cache not triggered because there are no dependencies?
Yes, I noticed this while creating an example for #8582.
HI everybody, I just found out this behavior while trying to work on a pipeline that should handle sensitive data, and it has become a mild headache.
In my use case. I run a dvc pipeline where the first step that fetches data from a db, and the next steps run some analysis on the fetched data. The results of the analysis can be cached and then pushed to dvc remotes, but the source data should under no circumstance exit the local machine in which they have been queried. I thought that setting cache: false
for the output of the fetching step was enough to prevent dvc from caching, but I keep finding them in .dvc/cache
after running dvc repro
. This makes me very uneasy in running dvc push
as I might unintentionally be contributing to a data leak. Do you happen to have a way to handle such scenario?
PS: I thought to remove the queried data from the outputs of the fetching step, but that would make my pipeline a bunch of unconnected steps. Feasible but sub-optimal given dvc usefulness in handling data flows.
Do you happen to have a way to handle such scenario?
Hi @davide-chiuchiu, we are discussing solutions for handling cache: fasle
on a draft P.R. here: https://github.com/iterative/dvc/pull/8738 .
Do you happen to have a way to handle such scenario?
Hi @davide-chiuchiu, we are discussing solutions for handling
cache: fasle
on a draft P.R. here: #8738 .
That looks very useful. I will track it to known when we can use it. Thank you!
Bug Report
UPDATE: Skip to https://github.com/iterative/dvc/issues/4428#issuecomment-778259232
As the title states. I run
dvc repro extract
for myextract
pipeline stage. This takes two pretty large zip files and extracts them into specified folders. These folders should not be added to the dvc cache, since they can be easily reproduced by extracting the archives, but I declare them as dependencies so that the DAG looks nicer.My
dvc.yaml
looks like this. Thepreprocess
stage should only indicate, that the not cached folders should be used as dependencies in later stages.After running
dvc repro extract
, the hashes of all the files are computed and then the files saved to cache. This is exactly the thing that I was trying to prevent with thecache: false
option.I confirmed that the contents of the output folders were indeed added to the cache by using
du -sh .dvc/cache
, which went up by exactly the size of the two folders after running the command.Interestingly, after running
dvc gc -a
the cache is freed again. Also runningdvc push
(without first runningdvc gc -a
) to push the cache to my remote storage also says everything is up to date, which leads me to believe that dvc recognizes that these directories should in fact not be cached.I have reproduced this in a local git and dvc repo by first adding the two archives using
dvc add
and then running the above mentioned extract stage. The archives can be downloaded from this repo https://github.com/Yixin-Hu/TetWild#dataset underOutput
.Please provide information about your setup
Output of
dvc version
:Additional Information (if any):
If applicable, please also provide a
--verbose
output of the command, eg:dvc add --verbose
.