dvc.yaml: future of foreach stages

jorgeorpinel commented 3 years ago

Sorry for long text in advance and as a disclaimer, I plan to tag others here and probably not be too involved in this discussion for now (if it happens). So please feel free to manage the issue as you best see fit.

First a Q: What is the main use case or reason for foreach stages? I haven't been able to find an independent issue where this is explained. It seems to have spawned naturally out of the parameterization feature (OP #3633). The usage is explained in its PR (#4734) but again I can't find the official motivation (what problem gets solved). Edit: I found some motivation in https://discuss.dvc.org/t/using-dvc-to-keep-track-of-multiple-model-variants/471/2 towards the idea of "generalizing stages", OK let's go with that for now.

The reason I ask is that while it's a greatly engineered feature, the current syntax may encourage a "misuse" of DVC. Specifically, it seems rather difficult to connect the stages defined inside foreach/do to one another (let alone among foreach clauses). For example:

stages:
  mystages:
    foreach: ${mylist}
    do:
      cmd: ./${item.exec} in ${item.in} out ${item.out}

First, the command in each stage should probably change (thus ${item.exec} above). Second, only if ${item.out}_i = ${item.in}_i+1 (i = [1 ... len(mylist)]) would these stages form a pipeline. This kind of patterns imply a very careful construction of params.yaml (or embedded vars:). While quite doable, I doubt all this will be obvious to users.

We can document it (maybe transfer this to dvc.org if that is the conclusion), but most users tend to jump into usage first and ask questions later. And it seems to me that by far the most intuitive way to use this now is to create a bunch of completely disconnected stages — which earlier led me to discuss #5181 — that are really one stage that partitions data inputs/outputs for batching or parallel processing (which we don't support yet).

So we may end up running into support cases like https://discuss.dvc.org/t/versioning-predictions/656 very often. BTW that case does have a nice usage for this feature: facilitate per-file data provenance.

Per some earlier discussions (example) I know we don't want foreach stages to be considered "groups", so I assume this is a misuse.

Questions:

Is there an alternative syntax that doesn't encourage disconnected stage sets? (perhaps a 2nd substitution expression like $${list.val} instead of foreach/do, as suggested originally).
Or do we want DVC to be everything it is + a dataset partitioning tool (and/or parallel processing framework #755) ?
If so, should we
- consider other features so it's not necessary to manually specify data partitions in params files/ vars:?
- reconsider the fallback run queueing order (#5181)?

Thanks!

See also #5172

dmpetrov commented 3 years ago

@jorgeorpinel really good questions. It seems like we should be clear with scenarios for foreach - what it aims to solve and what is not.

It feels like we are mixing up different scenarios while foreach was designed only with one of these scenarios in mind.

Wide stage. Generate multiple similar artifacts/models for a single next stage. a. Parallel pipelines. It is a special case of (1).
Long stage. Connect inputs and outputs like ${item.out}_i = ${item.in}_i+1 (i = [1 ... len(mylist)])

The initial motivation for foreach was the 1st scenario. An example, building 3 separate models for different markets (us, gb, ch) and later ensemble them together into a single model and metric.

@jorgeorpinel you are referencing mostly the 2nd scenario. And I agree that it might be not covered well in the current design and I don't know if we should focus on this scenario. Based on your experience, how often users implement (1) and (2). Which one is the most common case?

jorgeorpinel commented 3 years ago

OK so what I call "partitioning" is not a misuse after all, and foreach stages can often times be considered groups then.

Based on your experience, how often users implement (1) and (2)?

I personally can only remember wide stage scenarios from past support cases (another example). I have a feeling that (1) will be 90% of the usage but it's too early to tell. We'll find out I suppose 🙂

Also, I forgot to mention that the only small problem for the wide stage case in the current implementation is that if the width keeps changing, dvc.lock will grow linearly until the point that users may notice it and figure out that they have to delete it regularly. This is because DVC doesn't delete old states from dvc.lock if you change their names in dvc.yaml (to prevent merge conflicts in Git). I don't have a suggestion rn on how to improve this.

dberenbaum commented 3 years ago

Agreed with both of you that the wide stage scenario is what foreach solves and is its likely intended use (and that parallel execution is an additional but distinct consideration for these scenarios).

The issue presented in the https://discuss.dvc.org/t/versioning-predictions/656 is unique in that the inputs to foreach are completely different with each run. A more typical use case might be something like building an NLP model for 100 different languages, each of which is specified in params.yaml. More may be added over time, but it would be rare to completely change the set of languages.

In the scenario presented by mauro, the pipeline is being used not for model development/training, but for inference/prediction/scoring of new batches of data. This is a common use case (why else develop ML models?), but DVC does not have an obvious pattern for it right now.

I think the suggestions to either create separate directories or delete dvc.lock are reasonable. Another suggestion would be to create a branch (maybe named by date or some identifier of that batch of data) for each run to avoid conflicts in dvc.lock and to prevent any kind of "sequencing" since I'm assuming there is no connection between one run and the next. Another idea we could explore would be to have a command (or more likely reuse an existing command like status or list) that outputs the dvc.lock info specifically for the stages in the current workspace's pipeline/dvc.yaml.

I'd be interested to get feedback from mauro and others who want to use dvc for prediction using a pre-trained model (or users who are both regularly re-training and then making predictions on new data).

dmpetrov commented 3 years ago

I have a feeling that (1) will be 90% of the usage but it's too early to tell

Yep. It matches the feeling that we had when we were designing this feature.

I won't be surprised if we get more signals for the 2nd scenario - that would be the time to improve foreach. But the current design looks good.

I think the suggestions to either create separate directories or delete dvc.lock are reasonable.

Agree. It would be great to hear @skshetry opinion on this.

I'd be interested to get feedback from mauro and others who want to use dvc for prediction using a pre-trained model

💯 it would be great to find more scenarios and feedback. We can use that for the next version of the DVC pipelines.

jorgeorpinel commented 3 years ago

discuss.dvc.org/t/versioning-predictions/656 is unique in that the inputs to foreach are completely different with each run

And also in that per-file provenance is a requirement which is why they don't track entire directories as deps/outs, preferring a per-file foreach.

the pipeline is being used not for model development/training, but for inference/prediction/scoring of new batches of data

Yeah so maybe that's the unintended use? DVC pipeline ≠ general task automation

dberenbaum commented 3 years ago

Yeah so maybe that's the unintended use? DVC pipeline ≠ general task automation

Since it's a core part of ML pipelines, I'd prefer to say it's a use that's not well developed yet :)

jorgeorpinel commented 3 years ago

Hi guys. So back on this, it seems people are finding even more broad use cases for the foreach feature as I ~~feared~~ suspected (the name makes it seem like a very general control structure if you are familiar with programming). E.g. in https://discord.com/channels/485586884165107732/485596304961962003/823593858583232522 they are generalizing a Python call with cmd: python ${item.script} ${item.input} ... (but hitting a limitation because the stages all need to have the same no. of dependencies).

My Q is, is that a 3rd case we are trying to solve, or somewhat of a misuse? Thanks

jorgeorpinel commented 3 years ago

p.s.

reconsider the fallback run queueing order (#5181)? the name makes it seem like a very general control structure

See also #5644. I still think users will assume it behaves like a typical foreach loop (preserving the order of items passed) even when dvc.yaml is clearly declarative, not imperative.

dberenbaum commented 3 years ago

they are generalizing a Python call with cmd: python ${item.script} ${item.input} ... (but hitting a limitation because the stages all need to have the same no. of dependencies).

I think this is beyond the scope of what foreach supports, and it's a good example of where a Python API for creating the DAG might be better. @skshetry has started that discussion in #5646, so I'll just quote him:

We have already introduced foreach (aka for loops) and parametrization (aka fstring). There's a discussion about introducing map-reduce and nested-foreach within dvc.yaml. The question could be asked another way around: is it worth adding them in dvc.yaml if we can provide a good python API?

The Python API are usually very granular, so they can capture much more scenarios (known and unknown ones) and semantics than the supposedly high-level dvc.yaml file. With python, we have a complete programming language to our side, we can easily provide generalized and specialized APIs to satisfy user needs. In dvc.yaml, this is very much limited (or, we are going to add too many keywords and schemas to support them which makes it hard to read/edit/write.

efiop commented 11 months ago

Closing as stale

iterative / dvc

dvc.yaml: future of foreach stages #5440