iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.95k stars 1.19k forks source link

Support for parameterized sequential pipelines #10627

Open henrypickler opened 4 days ago

henrypickler commented 4 days ago

I want to parameterize how many times to repeat a stage which depends on previous stages. For example, consider the list [0, 0.2, 0.5, 0.75] and a held-out dataset. I want to have a pipeline that does the following:

Ideally I want to be able to modify the list to a different size, for example [0, 0.2, 0.4, 0.5, 0.75, 0.85, 0.95] where it would define re-train@0 until re-train@5. More than that, it then could re-use the cached model_0 and model_20 (model_50 and model_75 are different now because they depend on model_40).

I tried doing this using a foreach to define my stage. However, since I need to reference the previous stage dependency it is not possible, for example if this was possible:

re-train:
    foreach: [0,0.2,0.5,0.75]
    do:
        cmd: python train.py --reference-model=model_${prev_item} --output-model=model_${item}
        deps: [model_${prev_item}]
        outs: [model_${item}]

Then it would be fairly easy to chain the stages. However, AFAIK this is not possible, so my workaround is using an object defined in var such as:

re-trains:
  - {curr: 0.2, prev: 0}
  - {curr: 0.5, prev: 0.2}
  - {curr: 0.75, prev: 0.5}

And then referencing $item.curr and $item.prev. However this is error prone (setting prev wrongly gives weird results without prior warning) and a bit of a hassle to deal with.

I use DBT very frequently and so I think Jinja2 templating could be a good tool to have to deal with these cases. For example, my situation would be solved by doing something like this:

{% set stages = [0.2, 0.5, 0.75] %}

train:
  cmd: python train.py --output-model=model_0
  outs: [model_0]

{% for stage in stages %}
re-train@{{ loop.index0 }}:
    {% set input_model = 'model_0' if loop.first else 'model_' ~ stages[loop.index0 - 1] | replace(".", "_") %}
    {% set output_model = 'model_' ~ stage | replace(".", "_") %}
    cmd: python train.py --reference-model={{  input_model }} --output-model = {{ output_model }}
    deps:
      - {{ output_model }}
    outs:
      - {{ input_model }}
{% endfor %}

Putting it in a template renderer gives:

Rendered output ``` train: cmd: python train.py --output-model=model_0 outs: [model_0] re-train@0: cmd: python train.py --reference-model=model_0 --output-model = model_0_2 deps: - model_0_2 outs: - model_0 re-train@1: cmd: python train.py --reference-model=model_0_2 --output-model = model_0_5 deps: - model_0_5 outs: - model_0_2 re-train@2: cmd: python train.py --reference-model=model_0_5 --output-model = model_0_75 deps: - model_0_75 outs: - model_0_5 ```

I searched for jinja2 on the repo and it seems that it has been considered previously (and deemed too weird/ugly which, honestly, I agree, specially for beginners). However, drawing inspiration from it, another approach would be to allow arithmetic to be done on dvc string interpolation and also provide more values for loops, for example providing idx, which enables something like

vars:
    - retrains: [0.2,0.5,0.75]

train:
    cmd: python train.py --output-model=model_0
    outs: model_0

re-train:
    foreach: ${retrains}
    do:
        cmd: python train.py --reference-model=model_${idx} --output-model=model_${idx+1}
        deps: [model_${idx}]
        outs: [model_${idx+1}]

Which is much cleaner