iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.36k stars 1.16k forks source link

Stages with conditional dependency #10418

Open EvanKomp opened 1 month ago

EvanKomp commented 1 month ago

Correct me if this already exists, I seem to see some merges from 2018 that may be related (#646 ) but see no examples.

Essentially I have a stage that prepares a model, of which I would like to specify multiple options as parameters. Each model has a potentially unique preprocessing step, BUT some models share an additional preprocessing step.

For example, param model modulates stage predict, which for some models requires no previous stage, but for others requires a stage preprocess. How can I ensure that preprocess is run for the required models but not rerun it because it is expensive. If I have the preprocess step also conditioned on param model, it will rerun the step even if I switch between models where it does not need to be rerun.

Thanks for any wisdom.

dberenbaum commented 1 month ago

Could you provide a simplified dvc.yaml to clarify how your pipeline is set up?

EvanKomp commented 1 month ago

@dberenbaum

stages:
  preprocess:
    cmd:  ./prepare.sh
    outs:
      - ./data/preprocessing/
  predict:
    cmd ./predict.sh
    params:
      - model_type       # One of A, B, C
    deps:
      - ./data/preprocessing/       # THIS ONLY NEEDS TO BE A DEPENDANCY OF `model_type` in [A, B]
    outs:
      - ./data/predictions/
dberenbaum commented 1 month ago

Unfortunately, I can't think of a good way to do it without creating separate stages/pipelines. If you have some idea of what you would want it to look like, feel free to suggest it here.

EvanKomp commented 1 month ago

Affirmative. Thanks for your work. I think expanding on the yaml like you would with a cache tag would be best. eg.

stages:
  preprocess:
    cmd:  ./prepare.sh
    outs:
      - ./data/preprocessing/
  predict:
    cmd ./predict.sh
    params:
      - model.model_type       # One of A, B, C
    deps:
      - ./data/preprocessing/:

# conditioning syntax
           conditions: # these are executable strings with params as local namespace
             - 'model.model_type in ["A", "B"]

    outs:
      - ./data/predictions/