iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.91k stars 1.19k forks source link

Optional vars files: dynamic DAGs #10554

Open mike-grayhat opened 2 months ago

mike-grayhat commented 2 months ago

I have quite unusual case where I rely on variable generation from the first stage of the pipeline. The problem is that on the first run it doesn't exist yet which in turn invalidates the whole yaml file.

vars:
  - items: {}
  - items.yaml # non-existent before the first run

stages:
  collect_items:
    ...
  process:
    foreach: ${items}
    do:
     ...

I don't see an easy way out of it (even hydra works only on experiment runs, not on general dvc repros) and an option to skip missing variables would help a lot.

shcheklein commented 2 months ago

I think DVC needs all vars in such cases resolved before it can run the pipeline. Your vars essentially define the pipeline. It reads and compiles it first. So, even if allow missing files, it's a bigger change I think to make it dynamic. @skshetry could confirm that.

Does the content of the items.yaml change on every run?

mike-grayhat commented 2 months ago

The content of items.yaml gets generated based on external sources so it changes from time to time. The problem we face right now is that in theory we can put items.yaml under dvc, but we can't even pull it on fresh repo because dvc.yaml is not valid yet. Similarly dvc diff doesn't work. Static nature of dvc dag is a limiting factor for us, but we worked around the most problems except this one, in which case we have to rely on a separate pipeline to pull such files. I'm thinking of a better solution and haven't come up with one yet.