Open nono1515 opened 1 year ago
Thanks @nono1515! Nice description. We will take a look.
For reference and more context, here's the initial discord discussion: https://discord.com/channels/485586884165107732/1045519876430766100/1100383943724892290
When runnning dvc repro path/to/dvc.yaml with the --downstream argument, DVC will try to look for all dvc.yaml files in the workspace and also execute stages in the latter if they have dependancies downstream. This is inconsistent with dvc repro path/to/dvc.yaml which only executes stages in the given dvc pipeline.
The target path/to/dvc.yaml
does not mean that it's limited to that file. It'll try to run all dependencies of stages from that file, which may have been defined elsewhere.
In --downstream
, this means to continue reproducing all stages from that file and downstream stages of those. So this is working as intended.
I understand that this is a bit confusing, as in dvc, dvc.yaml
file is not a self-contained pipeline (even if we pretend it to be), and unit of processing is a "stage", not a file.
@skshetry The bug here is not that it runs stages from other dvc.yaml paths. The bug is this inconsistency:
Running
dvc repro pipelines/train/dvc.yaml
executes A and B. Runningdvc repro pipelines/train/dvc.yaml --downstream
executes A, B and C.
Should dvc repro pipelines/train/dvc.yaml
also execute C?
graph TD
A-->B
B-->C
That dvc.yaml
contains stages A
and B
. This is the same as doing dvc repro A B
, which reproduces A
and B
, as in it runs up to A
and B
including their dependencies.
C
depends on B
, A
and B
don't depend on C
.
In case of --downstream
, it means continuing repro from the specified targets and below, which means A
, B
(the targets themselves), and then C (downstream of the targets).
@nono1515, If you want to run only the stages in the file, you can do dvc repro -s dvc.yaml
.
You can reproduce the whole pipeline of all the stages in dvc.yaml using dvc repro --pipeline dvc.yaml
, which will run all dependencies and what follows of the stages in the dvc.yaml
file. In the above case, it'll run A
,B
, and C
.
You can use --all-pipelines
to run everything. This will run everything.
@nono1515, If you want to run only the stages in the file, you can do
dvc repro -s dvc.yaml
.
@skshetry How is it different from dvc repro dvc.yaml
? In the example above, that also only runs the stages in the file. I think I'm confused about the default behavior of targets. Are they reproducing the pipeline associated with that target, or only the specified stages, or something else (only upstream/downstream stages)?
@skshetry How is it different from
dvc repro dvc.yaml
? In the example above, that also only runs the stages in the file. I think I'm confused about the default behavior of targets. Are they reproducing the pipeline associated with that target, or only the specified stages, or something else (only upstream/downstream stages)?
I guess it's running all the upstream stages by default? If so, I think we can clarify better in the docs.
@dberenbaum, the stages that are in dvc.yaml
may have some dependencies defined elsewhere, in which case it will try to run them before running the stages in dvc.yaml
.
I guess it's running all the upstream stages by default? If so, I think we can clarify better in the docs.
I think the confusion is with dvc.yaml
as a target, not the upstream, if I understand you correctly.
Docs seem to indicate that the stages are limited to just the file if the target is a file, which is not the case.
dvc repro linear/dvc.yaml
: Advc.yaml
file
We also don't really document the multi dvc.yaml
file thingy, and in docs, pipelines and dvc.yaml
is synonymous.
Thanks @skshetry and sorry for my confusion here -- I see it's at most a docs issue now and doesn't seem like any behavior needs to change.
I think the confusion is with
dvc.yaml
as a target, not the upstream, if I understand you correctly. Docs seem to indicate that the stages are limited to just the file if the target is a file, which is not the case.
The docs do say:
Keep in mind that one dvc.yaml file does not necessarily equal one pipeline (although that is typical). So DVC reads all the dvc.yaml files in the workspace to rebuild pipeline(s).
Maybe it could be more prominent. Hard for me to say how others will interpret it, but I got confused here because once I was in a "downstream" mindset, I forgot that the default behavior is "upstream," and I don't see it explained anywhere that specifying a target means reproducing everything upstream (that is, everything up to and including the targets). In particular, the combination of different dvc.yaml files and upstream/downstream behavior is confusing, since the default may run stages outside of dvc.yaml, but only if they are upstream of targets specified in dvc.yaml.
Getting back to the original question, is there a way to run only downstream stages within a single dvc.yaml file?
Isn't that the same as dvc repro -s dvc.yaml
?
Isn't that the same as
dvc repro -s dvc.yaml
?
AFAIU the request from discord is different. They only want to run stages in dvc.yaml that are downstream from a specific stage, but dvc.yaml may also contain upstream or unrelated stages.
I find the usecase a bit odd. While I do see some usecases for --downstream
(and maybe --continue-until
here?), dvc is not a general-purpose task runner. dvc repro
's primary task is to reproduce the given target.
Limiting to a file can give surprising results, as dvc repro
runs your dag, which may be outside from where the "source" stages are.
Bug Report
Description
When runnning
dvc repro path/to/dvc.yaml
with the--downstream
argument, DVC will try to look for alldvc.yaml
files in the workspace and also execute stages in the latter if they have dependancies downstream. This is inconsistent withdvc repro path/to/dvc.yaml
which only executes stages in the given dvc pipeline.Reproduce
Let's say you have two pipelines with the following stages, outputs and dependancies
such that
and
Running
dvc repro pipelines/train/dvc.yaml
executes A and B. Runningdvc repro pipelines/train/dvc.yaml --downstream
executes A, B and C.Expected
dvc repro pipelines/train/dvc.yaml --downstream
should only run A and B, as C is not in the given pipeline, and to be consist withdvc repro pipelines/train/dvc.yaml
Environment information
Output of
dvc doctor
:Additional Information (if any):