iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.84k stars 1.18k forks source link

Pipeline variables from params file #3633

Closed dmpetrov closed 3 years ago

dmpetrov commented 4 years ago

With the introduction of the new multiple-stage pipeline, we will need to find a way of defining variables in the pipeline. For example, the intermediate file name cleansed.csv is used from two stages in the following pipeline and it needs to be defined into a variable:

stages:
    process:
        cmd: "./process.bin --input data --output cleansed.csv"
        deps:
             - path: data/
        outs:
             - path: cleansed.csv

    train:
        cmd: "python train.py"
        deps:
             - path: cleansed.csv
             - path: train.py
             - path: params.yaml
                params:
                     lr: 0.042
                     layers: 8
                     classes: 4
        outs:
             - path: model.pkl
             - path: log.csv
                cache: true
             - path: summary.json

We need to solve two problems here:

  1. Define a variable in one place and reuse it from multiple places/stages.
  2. Often users prefer to read file names from config files (like in the train stage), not from the command line (like in the process stage).

We can solve both of the problems using a single abstraction - parameters file variable:

stages:
    process:
        cmd: ./process.bin
        outs:
             - path: "params.yaml:cleansed_file_name"
        ....
    train:
        cmd: "python train.py"
        deps:
             - path: "params.yaml:cleansed_file_name"

This feature is useful in the current DVC design as well. It is convenient to read file names from params file and still define dependency properly like dvc run -d params.yaml:input_file -o params.yaml:model.pkl

karajan1001 commented 4 years ago

In my case I stored these global variables (vocabulary file, cleaned data file, label name file) in a configuration file as default. And the processing cli can accept all these variables to override the defualt one. But my configuration file parsing and merging is coded inside the program, and had to be re-coded in every program.

tall-josh commented 4 years ago

Before I knew about multiple-stage pipelines I used the params functionality and hacked together a script to build a richer dvc run command. I used the same dot.notation you can use in the params command line args ie: dvc run -p params.yaml:data.output,train ... In the end my params.yaml looks like:

data:
    url: s3://mydata.npz
    out_path: data/mydata.npz
data/-o:  # Each element in this list will be
          # added as a  -o <arg>
    # For access strings within the current stage
    # you can also use '.out_path'
    - data.out_path
data/script:
    - python src/cli/data_download.py
    - --data-size
    - .url  # equivalent to data.url
    - --out-path
    - .out_path # equivalent to data.out_path

train:
  lr: 0.001
  ckpts: arteftacts/ckpts
train/-d:
  - data.out_path
train/-o:
  - artefacts/tflogs
train/script:
  - python train.py
  - --data
  - data.out_path
  - --ckpts
  - .ckpts
  - --lr
  - .lr

I'm not suggesting the multi-stage pipeline changes. I guess all I'm saying is:

After Thought

After sleeping on it I thought, 'I can use jinja2 for this':

stages:
  data:
    cmd: python src/cli/data_download.py --data-size {{ args.data.size }} --out-path {{ args.data.out_path }}
    wdir: ..
    outs:
    - {{ args.data.out_path }}
  train:
    cmd: python src/cli/train.py -d {{ args.data.out_path }} --params {{ args.train.params }} --log-dir {{ args.train.lodir }} --metrics-path {{ args.train.metrics }} --ckpts {{ args.train.ckpt_path }}
    wdir: ..
    deps:
    - {{ args.data.out_path }}
    - {{ args.train.params }}
    metrics:
    - {{ args.train.metrics }}
    outs:
    - {{ args.train.logdir }}
    - {{ args.train.ckpt_dir }}

I then load my args.yaml file and convert it into a namedtuple (so I can use the dot.notation). Bingo bango, we have our filled in .dvc file

dsuess commented 4 years ago

Unfortunately, YAML anchors can only solve part of this: You can use them to fill in repeating dependencies like this:

stages:
    process:
        outs:
             - path: &cleansed cleansed.csv
       ...

    train:
        deps:
             - path: *cleansed
        ...

However, there doesn't seem to be an easy way to also "paste" them into a full command, unless we write some custom YAML parsing.

dmpetrov commented 4 years ago

@dsuess can jinja2-like approach from @tall-josh potentially solve this issue?

dsuess commented 4 years ago

@dmpetrov I like the idea, but I see two drawbacks:

  1. the necessary information is split across two files
  2. allowing full jinja-templating on the configuration file might be too much

To alleviate the first, maybe we can add a special section variables to the Dvcfile like this:

variables:
  data:
    size: 10
    out_path: ./data/
  ...

stages:
  data:
    cmd: python src/cli/data_download.py --data-size {{ args.data.size }} --out-path {{ args.data.out_path }}
    wdir: ..
    outs:
    - {{ args.data.out_path }}
  train:
     ...

and then only template the stages section with those variables. No templating would be allowed for that first section.

For the second one, we could either look into not allowing certain functionality (like conditionals or loops) in jinja2. Alternatively, we need to implement our own simple templating engine that only supports a very limited sets of features. So far, I only see variable substitution to be necessary.

Alternatively, we could implement our own PyYaml tag to support pasting like this

karajan1001 commented 4 years ago

@dsuess, good idea, but there is a questionvariables need to be shared among DVCfiles, many of them ( for me embedding size, vocabulary file, sequence length and etc.) must stay the same in all stages. Maybe we should make the DVCfile inherit parameters from its ancestors.

tall-josh commented 4 years ago

There are definitely use cases for both in-file variables and external variable files. I agree with @dsuess in that full templating is probably not the way to go as a proper solution. Maybe it's just a terminology thing but @karajan1001, if your thinking full blown inheritance then I think things could get messy. A simple way of importing from 1 or more files cloud be nice.

vars:
  IMPORT:
    somevars: some/vars.yaml
    morevars: more/vars.yaml

  data:
    size: 10
    out_path: ./data/
  ...

stages:
  data:
    cmd: python src/cli/data_download.py --data-size {{ vars.data.size }} --out-path {{ vars.data.out_path }}
    wdir: ..
    outs:
    - {{ args.data.out_path }}
  train:
     ...
    outs:
    - {{ somevars.something.x }}
    - {{ morevars.someotherthing.y }}
dmpetrov commented 4 years ago

@dsuess the variables section makes total sense. I see that as a simple form of variables with no param file associated. We definitely need it and this part has a higher priority.

It looks like all of us on the same page regarding Jinja - full support is too much.

Also, I like @tall-josh's idea of importing which unifies the explicit vars definition:

vars:
  IMPORT:
    somevars: some/vars.yaml
    morevars: more/vars.yaml

It looks like the high-level design is ready :) and we need to agree on the syntax and then a templating engine.

On the syntax side, I see these options:

  1. Special patter in strings like "$vars.somevar"
  2. jinja2-style variables: {{ vars.somevar }}
  3. PyYaml tag: !var [*vars.somevar]
  4. other options

What are your thoughts on this?

(3) seems like a more formalized way of doing (1)

PS: dvc run can use a bit different syntax or we can decide not to support variables on the command level - let's keep this discussion separately.

tall-josh commented 4 years ago

Hi @dmpetrov Yep, we're on the same page. I like the first syntax simply because it's concise.

dsuess commented 4 years ago

@karajan1001 Agreed. I think this change make most sense with #3584

@dmpetrov If we want to share variables between files, I think @tall-josh solution is better than just implicitly passing variables through from the dependencies. That would also cover the use case of having a single variables-file that's shared throughout the repo. However, I also think that not passing them between files wouldn't be a problem, if the multi-stage files from #3584 are the new standard. This way, files would scope variables nicely.

Regarding the syntax: I don't have any preference as long as the files are still valid YAML. The first one will interfere with standard shell variable substitution in the command, which might be a good thing as using non-tracked variables in the command will break reproducibility. It might also be a bad thing if there's a valid use case for having shell-variables in the command.

tall-josh commented 4 years ago

@dsuess I'm not sure multi-stage is quite at the point of being the new standard, though I think it will get there soon. Someone correct me if I'm wrong, but as of a few days ago multi-stage ( #3584 ) did not support --params :-( . Something I'm been meaning to bring up their thread. I'll do that now I guess.

dmpetrov commented 4 years ago

@dsuess agree, defining variables from command line is not the high priority feature. We should have a great solution in the pipeline file. Also, a good point about env variables.

@tall-josh it would be great to hear your params ideas! You can post it on the multi-stage issue or just here.

tall-josh commented 4 years ago

@dmpetrov. No probs, I've raised the issue ( #3584 ), but it's not really an idea. It's more along the lines of keeping the multi-stage consistent with the regular single-stage .dvc files. ie:

dvc run --params myparams.yaml:train "cat myparams.yaml"

Produces a dvc file:

md5: d978fc54ad4ee1ad65e241f0bd122482
cmd: cat myparams.yaml
deps:
- path: myparams.yaml
  params:
    train:
      epochs: 10
      lr: 0.001

Whereas using multi-stage

dvc run -n stage1 --params myparams.yaml:train "cat myparams.yaml"

Results in a DvcFile that looks like this. Plus a lock file. Neither of which refer to the train key in myparams.yaml

stages:
  stage1:
    cmd: cat myparams.yaml
    deps:
    - myparams.yaml

I've directed @skshetry to this thread too as I think the two issues would benefit from one and other.

skshetry commented 4 years ago

Hi @tall-josh. We now have params support for the pipeline file (as we are calling it at the moment). We are planning to release it soon (with a public beta in this week).

You have already seen the file format, so, I'll just enter to the discussions of variables. I'd propose not having an indirection of reading variables from the params file, but from the pipeline file itself. The pipeline file is very readable and clean, and we'll strive to make it so in the future.

And, we can introduce a variables section in the pipeline file, which can be used for reading params by the dvc. This way, there can be clear separation between params required by your scripts vs required by dvc so as not to duplicate information (but, again for read-only purpose, you can just add to the pipeline file, no problem on that).

So, it should look something like:

variables:
  data:
    size: 10
    out_path: ./data/
stages:
  data:
    cmd: python src/cli/data_download.py --data-size {{ data.size }} --out-path {{ data.out_path }}
    wdir: ..
    outs:
    - {{ data.out_path }}

Note that, I'm not focusing on templating formats (as it's likely going to be tied with the implementation), but we should try to make it logic-less. Providing logic will only make it complicated, and ugly.

cc @karajan1001 @dsuess

elgehelge commented 4 years ago

Seems that most of you agree that reusable yaml-variables should be defined directly in the pipeline file (I do too). So maybe you should rename the issue @dmpetrov ?

tall-josh commented 4 years ago

@skshetry Amazing news! I can't wait to give it a crack!

tall-josh commented 4 years ago

@skshetry I have finally got some overhead to give the params support for the pipeline file. Do you have an example somewhere?

EDIT:

Ahhh I misunderstood your comment. If I now understand correctly; There is now --params support for multistage dvc files, but you're still working on templating variables.

dmpetrov commented 4 years ago

@tall-josh yes, the templating is still in process. Now we are busy by releasing DVC 1.0 https://dvc.org/blog/dvc-3-years-and-1-0-release The parameter is a good feature for the next step. DVC 1.1 I think.

@elgehelge I think there is no need for renaming this issue. It was created specifically to find a solution for using params.yaml in the pipeline. It is a useful approach when you need to localize all the configuration in a single file - a very common pattern. Also, it might be useful when some external tools can properly read and modify simple parameters files params.yaml but do not understand the more complicated format of dvc.yaml. And we already good a lot of good ideas on how to implement it.

But I agree that the "static" params is a super useful tool and should be the first priority (I had no doubt with that). I can create a separate issue for the "static" parameters case.

I like how simple variables can be implemented:

variables:
  data:
    size: 10
    out_path: ./data/

Why don't we use the same approach for importing all the params:

params_variables: [params.yaml]

stages:
  mytrain:
    cmd: python train.py --batch-size {{ train.batch_size }} --out-path {{ train.out_path }}
    outs:
        - {{ train.out_path }}

As an option, we can import by param name or param section name:

params_variables:
    - params.yaml: [process.threshold, train]

In this case, dvc.lock should include all the params that were used in this stage like:

    params:
         train.batch_size: 2048
         train.out_path: "out/model.p"

What do you think folks?

tall-josh commented 4 years ago

@dmpetrov Thanks for the clarification. In the meantime, I had a crack at implementing a vars section at the top of the multistage dvc.yaml file using Jinja2. It's a bit janky, but it appears to have the desired effect.

https://github.com/tall-josh/dvc/tree/master/TEMPLATING_DEMO

The dvc.yaml file looks like:

vars:
    data: data.txt
    base_dir: base_dir
    stage1_out: stage1_output.txt
    stage2_out: stage2_output.txt
    splits:
        train: 0.89
        eval: 0.3

stages:
  st1:
    deps:
    - {{ vars.data }}
    outs:
    - {{ vars.base_dir }}/{{ vars.stage1_out }}
    cmd: >-
        mkdir {{ vars.base_dir }};
        cat {{ vars.data }} > {{ vars.base_dir }}/{{ vars.stage1_out }}
  st2:
    deps:
    - {{ vars.base_dir }}/{{ vars.stage1_out }}
    outs:
    - {{ vars.base_dir }}/{{ vars.stage2_out }}
    cmd: >-
        echo "train: {{ vars.splits['train'] }}" > {{ vars.base_dir }}/{{ vars.stage2_out }};
        cat {{ vars.base_dir }}/{{ vars.stage1_out }} >>
        {{ vars.base_dir }}/{{ vars.stage2_out }}
jcpsantiago commented 4 years ago

Hello everyone! I'm a bit late to the party, but here are some ideas about using variables for looping (original discussion started in https://discuss.dvc.org/t/using-dvc-to-keep-track-of-multiple-model-variants/471).

In a nutshell, I want to use the same pipeline to train models using different groups or sets of data. The steps are the same, but the data is slightly different and the outputs have different paths.

Having vars or a params.yaml file would be great! I like the idea of declaring how the DAG should look like and what parameters it needs to run

From the original post:

params:
  group1:
    name: "group 1"
    something_else: "foo"
  group2:
    name: "group 2"
    something_else: "meh"
  group3:
    name: "group 3"
    something_else: "gwah"

stages:
  get_some_data:
    cmd: ./get_data.sh {{params.name}}
    deps:
      - get_data.sh
      - query.sql
    outs:
      - data.csv
  train_model:
    cmd: Rscript train.R {{params.name}} {{params.something_else}}
    deps:
      - data.csv
    outs:
      - models/{{params.name}}/model.xgb
    metrics:
      - metrics/{{params.name}}/model_eval.json

No ideas at the moment regarding how to declare that some parameters should be looped on.

I imagine dvc repro would then run group-by-group the whole DAG from start to finish (as opposed to stage-by-stage).

tall-josh commented 4 years ago

I've been mulling this over too. Would be super handy!!

skshetry commented 4 years ago

This only focuses on parametrization and possible solutions that were explored.

Jinja (as a file templating engine)

Similar to @tall-josh's implementation on #4463.

  1. Considerably faster.
  2. We need to decode primitives again or YAML encode those values again (for an example, see Saltstack).
  3. Better if we follow jinja-based syntax (i.e. provide filters and context around it). For example, look into saltstack's filters or Ansible's.
  4. We won't be able to do dynamic substitution (eg: build-matrix and for-each loops). See: #331, #1018, #1462 and #4213.
  5. Also, we won't be able to know what values were substituted so that we could auto-track them (see: #4525).
  6. Default syntax is {{ var }}, which is error-prone in YAML as it's a JSON and requires enclosing with " " (quotes). But, it supports custom syntax too.

Jinja (substituting each of str recursively)

  1. Helps us to implement dynamic substitution (eg: build-matrix and for-each loops).
  2. Is slower as it needs to substitute too many times (for each string).
  3. Could not find a way to disable blocks, eg: if conditionals and for constructs. We only need a substitution for now.
  4. We won't be able to auto-track them. (see: #4525)

OmegaConf

Omegaconf is a YAML based hierarchical configuration system, i.e. it can load config from multiple files and merges them together. It also supports interpolating by referencing different parameters but within the config.

  1. Interpolation is done by ${} based syntax, which I personally prefer.
  2. It won't be useful to us, as the interpolation is "node" based, but within one merged config. I have to say, we'll implement more than half of what OmegaConf provides (eg: Importing configs, merging imports, interpolation and overriding values from CLI).
  3. A good thing about this lib is, it provides a good architecture and inspiration for DVC internals.

ConfigParser.ExtendedInterpolation and friends

Suited for ini formats. Syntax is ${section:key}. Looked into Thinc's, which extends it to support dot-based syntax.

Github Actions

I was unable to find how Github Actions parses them. We do know that:

Though, we do have a tool called act that allows us to run Github Actions locally. It is written in Go, and it seems it uses a javascript-interpreter to evaluate those expressions.

Similarly, another tool (written in JS) parses those expressions into a javascript AST.

str.format, string.Template and friends

str.format uses { var } syntax and it requires us to "sandbox" it properly. Still, hate the syntax. string.Template supports $var and ${var} syntax, but does not support accessing attributes (eg: ${x.y}).


I didn't find any good "readymade" libraries (or, a combination thereof) that could:

  1. Support expression interpolation, including lists, dicts, and decode primitive values (to int, floats, strings, bool, etc).
  2. Has a good syntax, that's easy on eyes and easily recognizable as "expand" syntax.
  3. Supports "import"-ing parameters and variables from different files, and merge them properly.
  4. Provide us information on what values were interpolated/used.
  5. Is fast.

So, I propose we roll our own custom interpolation with custom syntax. /cc @iterative/engineering

The proposal will soon follow.

skshetry commented 4 years ago

Note: I have to point out that ${} interpolation is a proposal, rest are the part of the discussions, that'll eventually be implemented on top of interpolation.

I propose the syntax ${ var } for interpolation, as many users will already be well familiar with the syntax and does not have problem like { var } has in YAML.

We have to handle the conversions (to list/dict/bool/string/int/floats et cetera) ourselves as well.

Open Question

  1. How to append to the list? Or, exclude certain elements off of it?

Regarding the imports of the variables, let me just show the example dvc.yaml:

args:
  imports:
   - params.yaml
   - params.toml
   - params.json:Train
   # all of this are merged together (note: only dicts are merged).
  constants:
    # these are also merged together
    prepare:
      script: src/train.py

stages:
  train:
    cmd: python ${prepare.script} ${lr} ${layers}
    deps:
    - data/data.xml
    - ${prepare.script}
    params:
    - lr
    - layers
    outs:
    - data/prepared

Let's say, params.yaml has following values:

lr: 0.042
layers: 8
classes: 4

You can see that, there's a duplication for params dependency and arguments in cmd section. So, as per #4525, it'd be nice to remove this duplication and make dvc understand them automatically.

This is still on the discussion, but @dmpetrov suggested to use ${{ var }} syntax, through which DVC should auto-track them (there will be entries on dvc.lock, it just won't be on dvc.yaml in params section). I quite like this syntax, although I have a few alt. syntaxes:

  1. ${} for expansion, and $() for auto-track and expansion.
  2. ${ } for expansion, and ${track: prepare.seed} for auto-track and expansion.
  3. ${} for expansion, and ${ track(prepare.seed) } for auto-track and expansion (similar to Actions).

Please do suggest, if you have a good syntax. Also note that, during discussion w/ @dmpetrov, we decided not to solve #4525, which proposes to auto-track but not expand variables, and instead do both of those together.


Why do we need to support "interpolation" to lists/dicts?

So, this is something that will be useful during foreach and/or build-matrix that will need to interpolate values from imported-params. (if not already useful for params substitution).

This is very early in the discussion, but I'll post what @dmpetrov suggested:

args:
  ...
stages:
  foreach: ${ outputs }
    prepare:
      cmd: python prepare.py ${ item }
      deps:
      - raw/${ item }
      outs:
      - processed/${ item }

This will support iterating over dicts, and item or something similar might be a "reserved" keyword inside looping, and for stage name, we could append the key to it's declared name (i.e. prepare-file1.txt and prepare-file2.txt).

You might be able to provide values directly as well. Something that I have in mind:

foreach:
  values:
    x: 1
    y: 2
  stage-name:
    ...

Though, if I look foreach through this usecase, I think it should be just inside of the stage rather than one-indentation outside as in the previous example.

There are a few open questions:

  1. Should we support nested interpolation?
    cmd: python script.py ${i} ${{ train${i} }}
  2. Do we require local "variable" of some sort, accessible inside each iteration of the loop? It could be something that user could set:
    set:
      data: ${prepare.data}  # would be accessible as ${ data }
    ...
  3. How to reference the top-level of "args", the whole merged value to loop through in foreach? Do we need a special name/keyword for it?
  4. Do we need to support custom include/exclude of the foreach values?
  5. Also, we might need to support some range-like cases, roughly following (as suggested by @dmpetrov):
    foreach "i = range(0, 5)":
      prepare:
            cmd: python script.py
            params:
            - train${i}

Some of the inspirations will for sure be taken from Ansible and pypyr.

Again, we just discussed this, so it's very early and take this comment on foreach with a grain of salt. I just wanted to share what we've discussed so far.


Did I tell you, that the config can be overridden from CLI? Eg:

$ dvc repro --params prepare.seed=3
dmpetrov commented 4 years ago

@skshetry thank you for the deep investgation and all the insights you found 💎

It feels like we are getting closer to the final design. I just want to share a few comments and my personal preferences.


args:
  imports:
   - params.yaml
  constants:

args is a bit too specific - the usage of the variables is wider than only args in cmd. I'd prefer vars or params over args. I'd prefer separate params (params file level abstraction) and vars (pipeline level abstraction). In that way we can think about using vars for dvc.yaml in all the places like dvc repro --vars prepare.seed=3.


  • ${} for expansion, and $() for auto-track and expansion.
  • ${ } for expansion, and ${track: prepare.seed} for auto-track and expansion.
  • ${} for expansion, and ${ track(prepare.seed) } for auto-track and expansion (similar to Actions).

The first option might look confusing for Unix users since both ${} & $() have the same meaning in Unix. All the other options look good to me. My personal preference ${} + ${{}} or ${} + ${param: prepare.seed} (track: is ok).


  foreach: ${ outputs }
    prepare:
      cmd: python prepare.py ${ item }

We might need some option to overwrite item like

   foreach: ${ outputs }
   item: input_file
     prepare:
       cmd: python prepare.py ${input_file}

  • Do we require local "variable" of some sort, accessible inside each iteration of the loop? It could be something that user could set:
    set:
    data: ${prepare.data}  # would be accessible as ${ data }

It is syntactic sugar. Not a priority at all.


  • How to reference the top-level of "args", the whole merged value to loop through in foreach? Do we need a special name/keyword for it?

I'n not sure I understand this.


  • Do we need to support custom include/exclude of the foreach values?

It seems a bit too advance. We should expect a very limited list of values on the list like build matrix in CI systems.


  • Also, we might need to support some range-like cases, roughly following (as suggested by @dmpetrov):
    foreach "i = range(0, 5)":
    prepare:
          cmd: python script.py
          params:
          - train${i}

It might look better as a separate option:

  foreach:
    range:
        var: i
        from: 0
        to: 5
    prepare:
          cmd: python script.py
          params:
          - train${i}

A couple more open questions:

tall-josh commented 4 years ago

Looks like I have some catching up to do!

skshetry commented 4 years ago

Looks like I have some catching up to do!

:slightly_smiling_face: @tall-josh, I was reviewing your pull request. Thank you so much to help this move forward.

shcheklein commented 4 years ago

@skshetry awesome research, a few questions:

1. Why do we need imports? At least initially. Keep the same name resolving mechanism as we use for parameters. We won't need to merge stuff, we can simplify constants section significantly.

2. Can we avoid creating a separate syntax to track something?

E.g.y treating external files as params, and constants section as ... well constants that serve templating purposes. We kinda force users to define tracked and non-tracked things in different places.

But to be honest I think that duplicating tracked section explicitly is better than introducing "magic" and some special syntax- that will make implementation harder, learning curve steeper.

3. loops - need to wrap my mind around it. First, what task are we solving with this? It looks like we solved the case where we generate multiple stages. Does it solve the case when we need to iterate on multiple items within a single stage? Should we prioritize this and keep pipeline-construction somewhere outside of dvc.yaml?

skshetry commented 4 years ago

1. Why do we need imports? At least initially. Keep the same name resolving mechanism as we use for parameters. We won't need to merge stuff, we can simplify constants section significantly.

It seems that there are two (there might be more) needs for parametrization. One is a simple loop/build-matrix scenarios that do not have to do anything with parameters, another one is building different datasets/models based on different parameters in params.yaml. So, these are not a separate thing, one or the other could be used for parametrization/loops.

One example that @dmpetrov provided was building models with different sets of params kept as:

us:
  filename: something
ch:
  filename: something-something

And, using these to build 2 different stages/models with different sets of parameters (that are inside of their respective keys).

2. Can we avoid creating a separate syntax to track something? Yes, I'd love to do that, but as I said above, they seem to kind-off interconnected. If we create a hard-line between params and constants, then this might work.

First, what task are we solving with loops?

I see it mostly as a generic build-matrix. But, it looks like we need different syntax/semantics based on how people try to use it and keep their parameters on params.yaml. (simple loops, nested loops/build-matrix, iterating just by the keys, etc.).

It looks like we solved the case where we generate multiple stages. Does it solve the case when we need to iterate on multiple items within a single stage?

I did not quite get these questions/statements.

Should we prioritize this and keep pipeline-construction somewhere outside of dvc.yaml?

Outside of the dvc.yaml, could you please elaborate this?

johnnychen94 commented 3 years ago

The biggest improvement of the parametrization feature I've experienced is that I can now scale the workflow by only modifying params.yaml:

This defines the minimal repro workflow (For simplicity I've removed a lot of params here), I use this for my gitlab CI setting:

# params.yaml

models:
  gaussian_15:
    name: gaussian_15
    data:
      noise_level: 15
    train:
      epochs: 2

datasets:
  train:
    name: train
    src: data/DnCNN.zip
    path: data/train/rawdata
  Set12:
    name: Set12
    src: data/DnCNN.zip
    path: data/test/Set12

testcases:
  set12_gaussian_15:
    model_name: gaussian_15
    noise_level: 15
    dataset: Set12

In this setting, I trained one model, and test the model on one dataset (Set12).

the corresponding dag is

image

I could easily repeat some of the contents in params.yml. In my case, I now trained three models, and test each model on two datasets(Set12 and Set68).

# params.yaml
models:
  gaussian_15:
    name: gaussian_15
    data:
      noise_level: 15
    train:
      epochs: 50

  gaussian_25:
    name: gaussian_25
    data:
      noise_level: 25
    train:
      epochs: 50

  gaussian_50:
    name: gaussian_50
    data:
      noise_level: 50
    train:
      epochs: 50

datasets:
  train:
    name: train
    src: data/DnCNN.zip
    path: data/train/rawdata
  Set12:
    name: Set12
    src: data/DnCNN.zip
    path: data/test/Set12
  Set68:
    name: Set68
    src: data/DnCNN.zip
    path: data/test/Set68

testcases:
  set12_gaussian_15:
    model_name: gaussian_15
    noise_level: 15
    dataset: Set12
  set12_gaussian_25:
    model_name: gaussian_25
    noise_level: 25
    dataset: Set12
  set12_gaussian_50:
    model_name: gaussian_50
    noise_level: 50
    dataset: Set12

  set68_gaussian_15:
    model_name: gaussian_15
    noise_level: 15
    dataset: Set68
  set68_gaussian_25:
    model_name: gaussian_25
    noise_level: 25
    dataset: Set68
  set68_gaussian_50:
    model_name: gaussian_50
    noise_level: 50
    dataset: Set68

:tada: the workflow gets scaled automatically without touching any of the other files:

image

It's still unclear to me what's the best practice to parameterized the workflow. But I'm pretty satisfied with how it makes the code clean.

I'm still playing with dvc and the parametrization. When I feel ready, I might summary them up as a blog.

dvc.yaml ```yaml stages: extract_data: foreach: ${datasets} do: cmd: mkdir -p ${item.path} && unzip -q ${item.src} "${item.name}/*" -d tmp && mv tmp/${item.name}/* ${item.path} deps: - ${item.src} outs: - ${item.path} prepare: foreach: ${models} do: cmd: >- julia --color=yes --startup=no --project=prepare -e "using Pkg; Pkg.instantiate()" && julia --color=yes --startup=no --project=prepare \ prepare/main.jl data/train/rawdata/ data/train/prepared \ --noise-level ${item.data.noise_level} \ --patch-stride ${item.data.patch_stride} \ --patch-size ${item.data.patch_size} \ --max-num-patches ${item.data.max_num_patches} deps: - data/train/rawdata - prepare outs: - data/train/prepared/${item.name}.h5 train: foreach: ${models} do: cmd: >- julia --color=yes --startup=no --project=train -e "using Pkg; Pkg.instantiate()" && julia --color=yes --startup=no --project=train \ train/main.jl data/train/prepared/${item.name}.h5 models/${item.name} \ --epochs ${item.train.epochs} \ --batch-size ${item.train.batch_size} \ --initial-lr ${item.train.initial_learning_rate} \ --lr-decay-ratio ${item.train.learning_rate_decay_ratio} \ --lr-decay-freq ${item.train.learning_rate_decay_frequency} \ --model-width ${item.train.model_width} \ --model-depth ${item.train.model_depth} \ --padding-size ${item.train.padding_size} \ --kernel-size ${item.train.kernel_size} \ --batch-size ${item.train.batch_size} \ --log-freq ${item.train.log_frequency} \ --use-gpu ${item.train.use_gpu} deps: - data/train/prepared/${item.name}.h5 - train outs: - models/${item.name} evaluate: foreach: ${testcases} do: cmd: >- julia --color=yes --startup=no --project=evaluate -e "using Pkg; Pkg.instantiate()" && julia --color=yes --startup=no --project=evaluate evaluate/main.jl \ models/${item.model_name} \ data/test/${item.dataset} \ results/${item.model_name}/${item.dataset} \ ${item.noise_level} deps: - data/test/${item.dataset} - models/${item.model_name} - evaluate outs: - results/${item.model_name}/${item.dataset}/processed_images plots: - results/${item.model_name}/${item.dataset}/checkpoints.csv - results/${item.model_name}/${item.dataset}/metrics.csv summary: cmd: >- julia --color=yes --startup=no --project=evaluate -e "using Pkg; Pkg.instantiate()" && julia --color=yes --startup=no --project=evaluate evaluate/summary.jl results metrics.json deps: - results - evaluate metrics: - metrics.json ```

Even simpler params.yaml

The params.yaml can be further simplified if dvc borrows travis build matrix semantics, e.g.:

stages:
  build:
    foreach-matrix:
          - ${model}
          - ${dataset}
    do:
      ...

or ansible filter semantics, e.g.,:

stages:
  build:
    # item[0] := ${model}
    # item[1] := ${dataset}
    foreach: ${model} | zip(${dataset})
    do:
      ...

I don't have a preference here, I like both specifications.

For normal data science projects, it might be better to adopt the Travis way because data scientists can adapt it more easily; they normally are not good programmers :fearful: I'm not sure of it, though.


running concurrent jobs

As the workflow can now scale up very easily, it becomes a pain to queue all jobs into one dvc repro sequentially. I still don't have a good sense of how this can be done programmatically. But having a loose read lock(#4979) sounds like a good start.

~I haven't followed all the discussions in #755 yet~; it can be quite challenging to let DVC figure out what is the best parallelization strategy. So my intuition is that, with the current dvc repro -s functionality, it might be easier to put into another YAML files:

Edit: This seems to be option 1 that @dmpetrov mentioned in https://github.com/iterative/dvc/issues/755#issuecomment-561031849

# jobs.yaml
jobs:
  - name: extract_data@*
    limit: 1 # 1 means no concurrency, this can be set as default value
  - name: prepare@*
    limit: 4 # at most 4 concurrent jobs
    env:
      JULIA_NUM_THREADS: 8
  - train@*
    limit: 4
    env:
      # if it's a scalar, apply to all jobs
      JULIA_NUM_THREADS: 8
      # if it's a list, iterate one by one
      CUDA_VISIBLE_DEVICES:
        - 0
        - 1
        - 2
        - 3
  - name: evaluate@*
    limit: 3
    env:
      JULIA_NUM_THREADS: 8
      CUDA_VISIBLE_DEVICES:
        - 0
        - 1
        - 2
  - name: summary

extract_data@* follows the same glob syntax in #4976

P.S. I don't have access to slurm cluster so my ideas on parallelization are basically on a local machine. It's not very clear to me whether distributed computation should be handled by dvc or by concrete languages/toolboxes that dvc calls.

skshetry commented 3 years ago

Hi, everyone. The parametrization feature is complete, at least the things that we planned for the next release. We are hoping to release a pre-release version this week. So, it'd be really helpful if you guys could provide feedback.

tall-josh commented 3 years ago

Are there docs or examples for this? Does it follow @johnnychen94 's latest comment? If there are no docs but I can get a nudge in the right direction I could put together a Notebook or something.

tall-josh commented 3 years ago

Ahhh think I found what I'm looking for in tests/func/test_run_multistage.py:test_run_params_default()

skshetry commented 3 years ago

Are there docs or examples for this?

@tall-josh, it's in the wiki as mentioned here: https://github.com/iterative/dvc/wiki/Parametrization

@johnnychen94's example also works.

Ahhh think I found what I'm looking for in tests/func/test_run_multistage.py:test_run_params_default()

What are you looking for exactly? I don't think it has anything to do with parameterization.

tall-josh commented 3 years ago

Thanks @skshetry that wiki perfect. Yeah, the test_run_params_default() did not give me what I want.