iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.38k stars 1.16k forks source link

repro pipelines/train/dvc.yaml --downstream: Also looks for other dvc pipelines in the repository #9381

Open nono1515 opened 1 year ago

nono1515 commented 1 year ago

Bug Report

Description

When runnning dvc repro path/to/dvc.yaml with the --downstream argument, DVC will try to look for all dvc.yaml files in the workspace and also execute stages in the latter if they have dependancies downstream. This is inconsistent with dvc repro path/to/dvc.yaml which only executes stages in the given dvc pipeline.

Reproduce

Let's say you have two pipelines with the following stages, outputs and dependancies

such that

$ dvc dag pipelines/train/dvc.yaml    
+----------------------------+ 
| pipelines/train/dvc.yaml:A | 
+----------------------------+ 
               *               
               *               
               *               
+----------------------------+ 
| pipelines/train/dvc.yaml:B | 
+----------------------------+ 

and

$ dvc dag pipelines/test/dvc.yaml 
+----------------------------+ 
| pipelines/train/dvc.yaml:A | 
+----------------------------+ 
               *               
               *               
               *               
+----------------------------+ 
| pipelines/train/dvc.yaml:B | 
+----------------------------+ 
               *               
               *               
               *               
+---------------------------+  
| pipelines/test/dvc.yaml:C |  
+---------------------------+  

Running dvc repro pipelines/train/dvc.yaml executes A and B. Running dvc repro pipelines/train/dvc.yaml --downstream executes A, B and C.

Expected

dvc repro pipelines/train/dvc.yaml --downstream should only run A and B, as C is not in the given pipeline, and to be consist with dvc repro pipelines/train/dvc.yaml

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.55.0 (pip)
-------------------------
Platform: Python 3.8.16 on Linux-5.15.108-1-MANJARO-x86_64-with-glibc2.34
Subprojects:
        dvc_data = 0.47.2
        dvc_objects = 0.21.2
        dvc_render = 0.3.1
        dvc_task = 0.2.1
        scmrepo = 1.0.2
Supports:
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        ssh (sshfs = 2023.4.1)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/d66f9faf43af22dff31dd5850172cab3

Additional Information (if any):

dberenbaum commented 1 year ago

Thanks @nono1515! Nice description. We will take a look.

For reference and more context, here's the initial discord discussion: https://discord.com/channels/485586884165107732/1045519876430766100/1100383943724892290

skshetry commented 1 year ago

When runnning dvc repro path/to/dvc.yaml with the --downstream argument, DVC will try to look for all dvc.yaml files in the workspace and also execute stages in the latter if they have dependancies downstream. This is inconsistent with dvc repro path/to/dvc.yaml which only executes stages in the given dvc pipeline.

The target path/to/dvc.yaml does not mean that it's limited to that file. It'll try to run all dependencies of stages from that file, which may have been defined elsewhere.

In --downstream, this means to continue reproducing all stages from that file and downstream stages of those. So this is working as intended.

I understand that this is a bit confusing, as in dvc, dvc.yaml file is not a self-contained pipeline (even if we pretend it to be), and unit of processing is a "stage", not a file.

dberenbaum commented 1 year ago

@skshetry The bug here is not that it runs stages from other dvc.yaml paths. The bug is this inconsistency:

Running dvc repro pipelines/train/dvc.yaml executes A and B. Running dvc repro pipelines/train/dvc.yaml --downstream executes A, B and C.

Should dvc repro pipelines/train/dvc.yaml also execute C?

skshetry commented 1 year ago
graph TD
  A-->B
  B-->C

That dvc.yaml contains stages A and B. This is the same as doing dvc repro A B, which reproduces A and B, as in it runs up to A and B including their dependencies. C depends on B, A and B don't depend on C.

In case of --downstream, it means continuing repro from the specified targets and below, which means A, B (the targets themselves), and then C (downstream of the targets).

skshetry commented 1 year ago

@nono1515, If you want to run only the stages in the file, you can do dvc repro -s dvc.yaml.

You can reproduce the whole pipeline of all the stages in dvc.yaml using dvc repro --pipeline dvc.yaml, which will run all dependencies and what follows of the stages in the dvc.yaml file. In the above case, it'll run A,B, and C.

You can use --all-pipelines to run everything. This will run everything.

dberenbaum commented 1 year ago

@nono1515, If you want to run only the stages in the file, you can do dvc repro -s dvc.yaml.

@skshetry How is it different from dvc repro dvc.yaml? In the example above, that also only runs the stages in the file. I think I'm confused about the default behavior of targets. Are they reproducing the pipeline associated with that target, or only the specified stages, or something else (only upstream/downstream stages)?

dberenbaum commented 1 year ago

@skshetry How is it different from dvc repro dvc.yaml? In the example above, that also only runs the stages in the file. I think I'm confused about the default behavior of targets. Are they reproducing the pipeline associated with that target, or only the specified stages, or something else (only upstream/downstream stages)?

I guess it's running all the upstream stages by default? If so, I think we can clarify better in the docs.

skshetry commented 1 year ago

@dberenbaum, the stages that are in dvc.yaml may have some dependencies defined elsewhere, in which case it will try to run them before running the stages in dvc.yaml.

I guess it's running all the upstream stages by default? If so, I think we can clarify better in the docs.

I think the confusion is with dvc.yaml as a target, not the upstream, if I understand you correctly. Docs seem to indicate that the stages are limited to just the file if the target is a file, which is not the case.

dvc repro linear/dvc.yaml: A dvc.yaml file

We also don't really document the multi dvc.yaml file thingy, and in docs, pipelines and dvc.yaml is synonymous.

dberenbaum commented 1 year ago

Thanks @skshetry and sorry for my confusion here -- I see it's at most a docs issue now and doesn't seem like any behavior needs to change.

I think the confusion is with dvc.yaml as a target, not the upstream, if I understand you correctly. Docs seem to indicate that the stages are limited to just the file if the target is a file, which is not the case.

The docs do say:

Keep in mind that one dvc.yaml file does not necessarily equal one pipeline (although that is typical). So DVC reads all the dvc.yaml files in the workspace to rebuild pipeline(s).

Maybe it could be more prominent. Hard for me to say how others will interpret it, but I got confused here because once I was in a "downstream" mindset, I forgot that the default behavior is "upstream," and I don't see it explained anywhere that specifying a target means reproducing everything upstream (that is, everything up to and including the targets). In particular, the combination of different dvc.yaml files and upstream/downstream behavior is confusing, since the default may run stages outside of dvc.yaml, but only if they are upstream of targets specified in dvc.yaml.

dberenbaum commented 1 year ago

Getting back to the original question, is there a way to run only downstream stages within a single dvc.yaml file?

skshetry commented 1 year ago

Isn't that the same as dvc repro -s dvc.yaml?

dberenbaum commented 1 year ago

Isn't that the same as dvc repro -s dvc.yaml?

AFAIU the request from discord is different. They only want to run stages in dvc.yaml that are downstream from a specific stage, but dvc.yaml may also contain upstream or unrelated stages.

skshetry commented 1 year ago

I find the usecase a bit odd. While I do see some usecases for --downstream (and maybe --continue-until here?), dvc is not a general-purpose task runner. dvc repro's primary task is to reproduce the given target.

Limiting to a file can give surprising results, as dvc repro runs your dag, which may be outside from where the "source" stages are.