iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.72k stars 1.18k forks source link

configurable default params.yaml (or templating entire pipelines) #7939

Open tibor-mach opened 2 years ago

tibor-mach commented 2 years ago

Hi, I have a setup where I use a single pipeline (with several stages) for training multiple models which are almost the same, but use different training data and parameters.

I currently have a copy of a dvc.yaml pipeline in a folder with the respetice params.yaml file used for each model. It looks more or less like this

stages:
 train_test_split:
    wdir: ../../../..
    cmd: >-
      python modules/regression/train_test_split.py
      --params=${paths.params_file}
    deps: ...
    outs: ...
    params:
      - ${paths.params_file}: # needs to be set due to a different working directory
          - paths.data_all
          - train_test_split
  assemble_model: ...
  optimize_hyperparams: ...
  fit_model: ...
  evaluate: ...

This works (I then always run dvc repro -P) but I have to copy the pipeline file which makes versioning difficult. The only part that is not (since it cannot be AFAIK) templated is the default params file.

I would love to have a dvc.yaml file in the root folder of my project which can be run with several different params.yaml files from several locations. Kind of like foreach ... do but on the level of the entire pipeline.

Also, I believe I have to explicitly add the path to the params file under the params keyword when I am running the stage from a different working directory...Not sure if that is a bug or a feature :-)

Thanks a lot!

P.S.: I tried a similar setup with templating all the stages but there are limitations in the way templating and foreach do work right now and also I feel like this would be a more elegant way to do this. The pipelines and the overall architecture are the same, what is different are the training data and (some) parameters, so having an option like "for each params file in list reproduce a separate instance of the pipeline" would make a lot of sense to me (it would them make sense to have separate lock files as well)

pmrowla commented 2 years ago

Also, I believe I have to explicitly add the path to the params file under the params keyword when I am running the stage from a different working directory...Not sure if that is a bug or a feature :-)

The params path will be interpreted as relative to wdir the same way as deps and outs. So if you don't specify params path it would look for params.yaml in wdir.

tibor-mach commented 2 years ago

Also, I believe I have to explicitly add the path to the params file under the params keyword when I am running the stage from a different working directory...Not sure if that is a bug or a feature :-)

The params path will be interpreted as relative to wdir the same way as deps and outs. So if you don't specify params path it would look for params.yaml in wdir.

yeah, that makes sense. thanks.

Is it then possible to specify the params path on the entire pipeline level? I could them simply write a simple loop in a shell script to go through all the different params files and call dvc repro "on" each one of them. It would still be nicer to do that explicitly in the dvc.yaml but this would also work.

pmrowla commented 2 years ago

Is it then possible to specify the params path on the entire pipeline level? I could them simply write a simple loop in a shell script to go through all the different params files and call dvc repro "on" each one of them. It would still be nicer to do that explicitly in the dvc.yaml but this would also work.

No, not at the moment. Doing this via templating the way you have it set up now is probably still the best way to accomplish it.

tibor-mach commented 2 years ago

Is that something I could help with perhaps (in case this is a feature you'd like to include)? I am not very familiar with the inner workings of dvc at this level of detail but this (a configurable default params file) does not sound particularly complicated and it would definitely help me a lot so I'd love to help implementing that.

tibor-mach commented 2 years ago

Actually, I still don't quite get how the params path works. The thing is that I am referencing the path to the params file in the dvc.yaml like this

train_test_split:
    wdir: ../../../..
    cmd: >-
      python modules/estimation/train_test_split.py
      --params=${paths.params_file}
    deps: ...
    outs: ...
    params:
      - ${paths.params_file}: # needs to be set due to a different working directory
          - paths.data_all
          - train_test_split

I understand that first, dvc looks for the params.yaml file in the working directory. But how does it actually find it otherwise? I set the path via templating in params like this

    params:
      - ${paths.params_file}: # needs to be set due to a different working directory

but the paths keyword is already a part of the params.yaml file, the reference is kind of circular...So dvc somehow has to know where to look for the param.yaml, otherwise it could not use the templating reference. But if it does, why is it necessary to explicitly mention the path? I mean it works, but it seems a bit strange to me.

daavoo commented 2 years ago

@tibor-mach This answer might help you understand the differences between params and templating resolving: https://github.com/iterative/dvc/issues/7316#issuecomment-1027703686

daavoo commented 2 years ago

Is that something I could help with perhaps (in case this is a feature you'd like to include)? I am not very familiar with the inner workings of dvc at this level of detail but this (a configurable default params file) does not sound particularly complicated and it would definitely help me a lot so I'd love to help implementing that.

I think configuring a default params file could be a good simple feature to add.

The default path is defined here:

https://github.com/iterative/dvc/blob/af649af46276b662b4fa03fd6ab63c36521f28aa/dvc/dependency/param.py#L38

And (I hope) that it is the single source of truth.

If you would like to make it configurable, you would need to first add a new config option (https://github.com/iterative/dvc/blob/af649af46276b662b4fa03fd6ab63c36521f28aa/dvc/config.py.

The way I would do it is by updating the DEFAULT_PARAMS_FILE behavior, probably converting it to a @property that checks if the config option is set, otherwise return params.yaml

tibor-mach commented 2 years ago

@tibor-mach This answer might help you understand the differences between params and templating resolving: #7316 (comment)

I see, that setup is a bit contraintuitive to me, but I gues I understand the behaviour better now :-)

Is that something I could help with perhaps (in case this is a feature you'd like to include)? I am not very familiar with the inner workings of dvc at this level of detail but this (a configurable default params file) does not sound particularly complicated and it would definitely help me a lot so I'd love to help implementing that.

I think configuring a default params file could be a good simple feature to add.

The default path is defined here:

https://github.com/iterative/dvc/blob/af649af46276b662b4fa03fd6ab63c36521f28aa/dvc/dependency/param.py#L38

And (I hope) that it is the single source of truth.

If you would like to make it configurable, you would need to first add a new config option (https://github.com/iterative/dvc/blob/af649af46276b662b4fa03fd6ab63c36521f28aa/dvc/config.py.

The way I would do it is by updating the DEFAULT_PARAMS_FILE behavior, probably converting it to a @property that checks if the config option is set, otherwise return params.yaml

Cool, seems simple enough. I'll have a look at it, thanks!

tibor-mach commented 2 years ago

@daavoo Just one more thing...

How is this going to work with dvc.lock? Am I going to get a single huge dvc.lock file (say for 10 pipeline runs, each with different params.yaml) or will I get one dvc.lock for each pipeline (that would be desired, at least by me).

simplymathematics commented 8 months ago

I hacked something like this together using a hydra conf/ folder to first parse a set of experiment parameters using their syntax (like the dvc-supported hydra mode), but the dvc exp command is exceptionally slow when trying to run a large number of experiments. Instead of tracking outputs from individual experiments, I just output them to a give folder and tracked the whole folder. It's a quick hack that doesn't necessarily have a one-to-one correspondence for inputs and outputs, but I've found it really useful for doing large hyper-parameter searches across multi-stage pipelines, especially with long-running models and search-spaces that benefit from caching. I suppose it wouldn't be hard to implement something like this into that instead of explicit file space configuration like foreach or matrix stages, just uses the md5 hash of a set of parameters.

Let's assume you need to do some multi-objective search using hydra where you might have a set number of trials and a parameter space, but the search space is large enough that naming each data, model, metric, plot, manually becomes burdensome. Let's say 1000 trials across 5 hyper-parameters that could be categorical, ints, floats, ranges, distributions, etc.

It shouldn't be too difficult to add syntax to support this kind of reproducible search for example using:

search:
     cmd : python example/script.py --multirun ${hydra.sweeper.params} # Normal optuna hydra params
     outs:
          - ${hydra.sweeper.storage}: # normal hydra parameter
              - persist: true (if you want to, for example, change the set of random states without deleting older results)
          - ${hydra.sweep.dir} # normal hydra parameter
          - ${data.item}.json #finds the md5 hash of the data param dictionary
          - ${model.item}.json #finds the md5 hash of the model param dictionary
      metrics: 
            - results/${item}.json # finds the md5 has of all the parameters
      params: # This would have to point to a hydra conf folder instead of the normal params.yaml
           - conf/default.yaml :#Uses the normal hydra configuration folder
             - data
             - model

In this way, you can define an arbitrary search with tracked inputs and outputs, but not have to name the outputs explicitly. The one downside to this particular syntax would be that you'd need to create a database (e.g. the hydra.sweeper.storage) rather than exploiting the database format that optuna (as in the link above) generates, meaning you'd need a new database for each output, rather than a table, which is offered by hydra using the ${hydra.sweeper.study_name} syntax in the hydra-optuna plugin. You could run a separate optuna-dashboard server on that database and then query a URL, however, but dvc's support for this kind of parallel search is limited to grid search and explicit file configurations.

The flexibility of the full hydra launcher syntax (supporting distributed queues, multi-objective search, minute joblib configuration, etc) is far preferable to the limitations of dvc exp in general, so I think extend the foreach/matrix idea to implicit file space configuration would be worthwhile.

In this way, you could test a set of model configurations across several reproducible set of samples without having giant files like results/model.layers=4/model.channels=3/model.epochs=20/model.output=logits/data.random_state/data.preprocessor/data.sampler/

but results/\<hash>.json or results/\<hash>/ depending on the presence of a path suffix.