iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.71k stars 1.18k forks source link

flatten `deps` to facilitate more flexible templating #7151

Open itcarroll opened 2 years ago

itcarroll commented 2 years ago

The templating feature currently admits little flexibility for stage keys (e.g. deps) that expect lists. As a result, foreach stages cannot be given a variable number of dependencies. Causing lists to be flattened would make templating more powerful. For example, you could give extra dependencies to certain templated stages:

stages:
  build:
    foreach:
      us:
        deps:
      uk:
        deps:
        - uk_data
    do:
      cmd: myscript ${key}
      deps:
      - myscript
      - ${item.deps}

I have implemented the change at https://github.com/itcarroll/dvc/blob/feature/flattened-deps/dvc/dependency/__init__.py#L52, by (non-recursive) list flattening in dvc.dependency.loads_from. I am sure it's not the right way to do it, and probably has unintended consequences, but it demonstrates the concept (and no tests failed).

I don't know how templating responds to missing keys, but not having to include the empty list for us would be even more better.

itcarroll commented 2 years ago

In ancient discussions on foreach I found a suggestion (by @jorgeorpinel):

I wonder actually if you can just do deps: ${list} (a list from params.yaml)

This would add the same flexibility I'm interested in with the above, but it fails YAML validation, which appears to require deps to get a list before ${item} is expanded. So, some change in the validation process might be an alternate route to this feature.

alealv commented 2 years ago

Hi, I'm also interested in this feature https://discord.com/channels/485586884165107732/485596304961962003/946817686220984350

   data:
    foreach:
      librispeech:
        script: libri.sh
        links:
          - https://www.openslr.org/resources/12/dev-clean.tar.gz
          - https://www.openslr.org/resources/12/dev-other.tar.gz
          - https://www.openslr.org/resources/12/test-clean.tar.gz
          - https://www.openslr.org/resources/12/test-other.tar.gz
          - https://www.openslr.org/resources/12/train-clean-100.tar.gz
          - https://www.openslr.org/resources/12/train-clean-360.tar.gz
          - https://www.openslr.org/resources/12/train-other-500.tar.gz
      tedlium:
        script: tedlium.sh
        links:
          - https://www.openslr.org/resources/51/TEDLIUM_release-3.tgz
    do:
      cmd:
        - ./proc/${item.script} --output ${nas_data_dir}/${key} --tmp ${nas_data_dir} ${item.links}
      deps:
        - ${item.script}
        - ${item.links}
      outs:
        - data/${key}/${key}_text
        - data/${key}/${key}_utt2spk
        - data/${key}/${key}_wav.scp
        - data/${key}/${key}_waves

I have stages that have different list lengths

daavoo commented 2 years ago

So, some change in the validation process might be an alternate route to this feature.

This relates to other requests where templating is being limited by the YAML validation which occurs before the interpolation, for example https://github.com/iterative/dvc/issues/6989

dberenbaum commented 2 years ago

@daavoo Do you know if these scenarios work fine if not for YAML validation failures?

daavoo commented 2 years ago

@daavoo Do you know if these scenarios work fine if not for YAML validation failures?

For #6989 yes, for this issue additional work would be needed

dberenbaum commented 2 years ago

If all the deps are a single list (deps: ${those}, as suggested in #8171), is additional work needed?

m-gris commented 1 year ago

Big up-vote for this one 👍

It would be really, really helpful.

bezbahen0 commented 1 year ago

Is there any progress on this issue? It really is a much-needed feature.

dberenbaum commented 1 year ago

@bezbahen0 We have not made any progress yet.

@skshetry @daavoo Can we resolve the templating before doing validation? If that's not straightforward, can we skip validation errors triggered by unresolved interpolation?

daavoo commented 1 year ago

@skshetry @daavoo Can we resolve the templating before doing validation? If that's not straightforward, can we skip validation errors triggered by unresolved interpolation?

Will take a look, it has been a while 😅

JulianoLagana commented 1 year ago

This feature would greatly simplify our current pipeline specification!

skshetry commented 1 year ago

We don't have a good way to validate after templating, say, for example, to determine that deps has to be a list of strings/file-paths after the interpolation.

That's why we limited templated dvc.yaml and the resolved dvc.yaml to follow the same schema.

dberenbaum commented 1 year ago

Will take a look, it has been a while 😅

I already ended up taking a look and might open a PR for discussion.