I want to parameterize how many times to repeat a stage which depends on previous stages. For example, consider the list [0, 0.2, 0.5, 0.75] and a held-out dataset. I want to have a pipeline that does the following:
Start: train an initial model_0
re-train@0: process 20% of the held-out dataset using the model model_0 and re-train a new model, model_20, including the newly processed samples.
re-train@1: process the next 30% of the held out data with model_20 and re-train model_50
re-train@2: process the next 25% with model_50 and re-train model_75
Ideally I want to be able to modify the list to a different size, for example [0, 0.2, 0.4, 0.5, 0.75, 0.85, 0.95] where it would define re-train@0 until re-train@5. More than that, it then could re-use the cached model_0 and model_20 (model_50 and model_75 are different now because they depend on model_40).
I tried doing this using a foreach to define my stage. However, since I need to reference the previous stage dependency it is not possible, for example if this was possible:
And then referencing $item.curr and $item.prev. However this is error prone (setting prev wrongly gives weird results without prior warning) and a bit of a hassle to deal with.
I use DBT very frequently and so I think Jinja2 templating could be a good tool to have to deal with these cases. For example, my situation would be solved by doing something like this:
I searched for jinja2 on the repo and it seems that it has been considered previously (and deemed too weird/ugly which, honestly, I agree, specially for beginners). However, drawing inspiration from it, another approach would be to allow arithmetic to be done on dvc string interpolation and also provide more values for loops, for example providing idx, which enables something like
I want to parameterize how many times to repeat a stage which depends on previous stages. For example, consider the list
[0, 0.2, 0.5, 0.75]
and a held-out dataset. I want to have a pipeline that does the following:model_0
model_0
and re-train a new model,model_20
, including the newly processed samples.model_20
and re-trainmodel_50
model_50
and re-trainmodel_75
Ideally I want to be able to modify the list to a different size, for example
[0, 0.2, 0.4, 0.5, 0.75, 0.85, 0.95]
where it would definere-train@0
untilre-train@5
. More than that, it then could re-use the cachedmodel_0
andmodel_20
(model_50
andmodel_75
are different now because they depend onmodel_40
).I tried doing this using a
foreach
to define my stage. However, since I need to reference the previous stage dependency it is not possible, for example if this was possible:Then it would be fairly easy to chain the stages. However, AFAIK this is not possible, so my workaround is using an object defined in
var
such as:And then referencing
$item.curr
and$item.prev
. However this is error prone (settingprev
wrongly gives weird results without prior warning) and a bit of a hassle to deal with.I use DBT very frequently and so I think Jinja2 templating could be a good tool to have to deal with these cases. For example, my situation would be solved by doing something like this:
Putting it in a template renderer gives:
Rendered output
``` train: cmd: python train.py --output-model=model_0 outs: [model_0] re-train@0: cmd: python train.py --reference-model=model_0 --output-model = model_0_2 deps: - model_0_2 outs: - model_0 re-train@1: cmd: python train.py --reference-model=model_0_2 --output-model = model_0_5 deps: - model_0_5 outs: - model_0_2 re-train@2: cmd: python train.py --reference-model=model_0_5 --output-model = model_0_75 deps: - model_0_75 outs: - model_0_5 ```I searched for jinja2 on the repo and it seems that it has been considered previously (and deemed too weird/ugly which, honestly, I agree, specially for beginners). However, drawing inspiration from it, another approach would be to allow arithmetic to be done on dvc string interpolation and also provide more values for loops, for example providing
idx
, which enables something likeWhich is much cleaner