Closed jorgeorpinel closed 11 months ago
@jorgeorpinel really good questions. It seems like we should be clear with scenarios for foreach
- what it aims to solve and what is not.
It feels like we are mixing up different scenarios while foreach
was designed only with one of these scenarios in mind.
${item.out}
i = ${item.in}
i+1 (i = [1 ... len(mylist)
])The initial motivation for foreach
was the 1st scenario. An example, building 3 separate models for different markets (us
, gb
, ch
) and later ensemble them together into a single model and metric.
@jorgeorpinel you are referencing mostly the 2nd scenario. And I agree that it might be not covered well in the current design and I don't know if we should focus on this scenario. Based on your experience, how often users implement (1) and (2). Which one is the most common case?
OK so what I call "partitioning" is not a misuse after all, and foreach
stages can often times be considered groups then.
Based on your experience, how often users implement (1) and (2)?
I personally can only remember wide stage scenarios from past support cases (another example). I have a feeling that (1) will be 90% of the usage but it's too early to tell. We'll find out I suppose 🙂
Also, I forgot to mention that the only small problem for the wide stage case in the current implementation is that if the width keeps changing, dvc.lock will grow linearly until the point that users may notice it and figure out that they have to delete it regularly. This is because DVC doesn't delete old states from dvc.lock if you change their names in dvc.yaml (to prevent merge conflicts in Git). I don't have a suggestion rn on how to improve this.
Agreed with both of you that the wide stage scenario is what foreach
solves and is its likely intended use (and that parallel execution is an additional but distinct consideration for these scenarios).
The issue presented in the https://discuss.dvc.org/t/versioning-predictions/656 is unique in that the inputs to foreach
are completely different with each run. A more typical use case might be something like building an NLP model for 100 different languages, each of which is specified in params.yaml
. More may be added over time, but it would be rare to completely change the set of languages.
In the scenario presented by mauro, the pipeline is being used not for model development/training, but for inference/prediction/scoring of new batches of data. This is a common use case (why else develop ML models?), but DVC does not have an obvious pattern for it right now.
I think the suggestions to either create separate directories or delete dvc.lock
are reasonable. Another suggestion would be to create a branch (maybe named by date or some identifier of that batch of data) for each run to avoid conflicts in dvc.lock
and to prevent any kind of "sequencing" since I'm assuming there is no connection between one run and the next. Another idea we could explore would be to have a command (or more likely reuse an existing command like status
or list
) that outputs the dvc.lock
info specifically for the stages in the current workspace's pipeline/dvc.yaml
.
I'd be interested to get feedback from mauro and others who want to use dvc for prediction using a pre-trained model (or users who are both regularly re-training and then making predictions on new data).
I have a feeling that (1) will be 90% of the usage but it's too early to tell
Yep. It matches the feeling that we had when we were designing this feature.
I won't be surprised if we get more signals for the 2nd scenario - that would be the time to improve foreach
. But the current design looks good.
I think the suggestions to either create separate directories or delete
dvc.lock
are reasonable.
Agree. It would be great to hear @skshetry opinion on this.
I'd be interested to get feedback from mauro and others who want to use dvc for prediction using a pre-trained model
💯 it would be great to find more scenarios and feedback. We can use that for the next version of the DVC pipelines.
discuss.dvc.org/t/versioning-predictions/656 is unique in that the inputs to foreach are completely different with each run
And also in that per-file provenance is a requirement which is why they don't track entire directories as deps/outs, preferring a per-file foreach
.
the pipeline is being used not for model development/training, but for inference/prediction/scoring of new batches of data
Yeah so maybe that's the unintended use? DVC pipeline ≠ general task automation
Yeah so maybe that's the unintended use? DVC pipeline ≠ general task automation
Since it's a core part of ML pipelines, I'd prefer to say it's a use that's not well developed yet :)
Hi guys. So back on this, it seems people are finding even more broad use cases for the foreach
feature as I feared suspected (the name makes it seem like a very general control structure if you are familiar with programming). E.g. in https://discord.com/channels/485586884165107732/485596304961962003/823593858583232522 they are generalizing a Python call with cmd: python ${item.script} ${item.input} ...
(but hitting a limitation because the stages all need to have the same no. of dependencies).
My Q is, is that a 3rd case we are trying to solve, or somewhat of a misuse? Thanks
p.s.
reconsider the fallback run queueing order (#5181)? the name makes it seem like a very general control structure
See also #5644. I still think users will assume it behaves like a typical foreach loop (preserving the order of items passed) even when dvc.yaml is clearly declarative, not imperative.
they are generalizing a Python call with
cmd: python ${item.script} ${item.input} ...
(but hitting a limitation because the stages all need to have the same no. of dependencies).
I think this is beyond the scope of what foreach
supports, and it's a good example of where a Python API for creating the DAG might be better. @skshetry has started that discussion in #5646, so I'll just quote him:
We have already introduced foreach (aka for loops) and parametrization (aka fstring). There's a discussion about introducing map-reduce and nested-foreach within dvc.yaml. The question could be asked another way around: is it worth adding them in dvc.yaml if we can provide a good python API?
The Python API are usually very granular, so they can capture much more scenarios (known and unknown ones) and semantics than the supposedly high-level dvc.yaml file. With python, we have a complete programming language to our side, we can easily provide generalized and specialized APIs to satisfy user needs. In dvc.yaml, this is very much limited (or, we are going to add too many keywords and schemas to support them which makes it hard to read/edit/write.
Closing as stale
Sorry for long text in advance and as a disclaimer, I plan to tag others here and probably not be too involved in this discussion for now (if it happens). So please feel free to manage the issue as you best see fit.
First a Q: What is the main use case or reason for foreach stages? I haven't been able to find an independent issue where this is explained. It seems to have spawned naturally out of the parameterization feature (OP #3633). The usage is explained in its PR (#4734) but again I can't find the official motivation (what problem gets solved). Edit: I found some motivation in https://discuss.dvc.org/t/using-dvc-to-keep-track-of-multiple-model-variants/471/2 towards the idea of "generalizing stages", OK let's go with that for now.
The reason I ask is that while it's a greatly engineered feature, the current syntax may encourage a "misuse" of DVC. Specifically, it seems rather difficult to connect the stages defined inside
foreach/do
to one another (let alone among foreach clauses). For example:First, the command in each stage should probably change (thus
${item.exec}
above). Second, only if${item.out}
i =${item.in}
i+1 (i = [1 ...len(mylist)
]) would these stages form a pipeline. This kind of patterns imply a very careful construction of params.yaml (or embeddedvars:
). While quite doable, I doubt all this will be obvious to users.We can document it (maybe transfer this to dvc.org if that is the conclusion), but most users tend to jump into usage first and ask questions later. And it seems to me that by far the most intuitive way to use this now is to create a bunch of completely disconnected stages — which earlier led me to discuss #5181 — that are really one stage that partitions data inputs/outputs for batching or parallel processing (which we don't support yet).
Per some earlier discussions (example) I know we don't want foreach stages to be considered "groups", so I assume this is a misuse.
Questions:
$${list.val}
instead offoreach/do
, as suggested originally).vars:
?Thanks!