Open EvanKomp opened 1 month ago
Could you provide a simplified dvc.yaml to clarify how your pipeline is set up?
@dberenbaum
stages:
preprocess:
cmd: ./prepare.sh
outs:
- ./data/preprocessing/
predict:
cmd ./predict.sh
params:
- model_type # One of A, B, C
deps:
- ./data/preprocessing/ # THIS ONLY NEEDS TO BE A DEPENDANCY OF `model_type` in [A, B]
outs:
- ./data/predictions/
Unfortunately, I can't think of a good way to do it without creating separate stages/pipelines. If you have some idea of what you would want it to look like, feel free to suggest it here.
Affirmative. Thanks for your work. I think expanding on the yaml like you would with a cache
tag would be best. eg.
stages:
preprocess:
cmd: ./prepare.sh
outs:
- ./data/preprocessing/
predict:
cmd ./predict.sh
params:
- model.model_type # One of A, B, C
deps:
- ./data/preprocessing/:
# conditioning syntax
conditions: # these are executable strings with params as local namespace
- 'model.model_type in ["A", "B"]
outs:
- ./data/predictions/
Correct me if this already exists, I seem to see some merges from 2018 that may be related (#646 ) but see no examples.
Essentially I have a stage that prepares a model, of which I would like to specify multiple options as parameters. Each model has a potentially unique preprocessing step, BUT some models share an additional preprocessing step.
For example, param
model
modulates stagepredict
, which for some models requires no previous stage, but for others requires a stagepreprocess
. How can I ensure that preprocess is run for the required models but not rerun it because it is expensive. If I have the preprocess step also conditioned on parammodel
, it will rerun the step even if I switch between models where it does not need to be rerun.Thanks for any wisdom.