Open tibor-mach opened 2 years ago
Also, I believe I have to explicitly add the path to the params file under the params keyword when I am running the stage from a different working directory...Not sure if that is a bug or a feature :-)
The params
path will be interpreted as relative to wdir
the same way as deps
and outs
. So if you don't specify params
path it would look for params.yaml
in wdir
.
Also, I believe I have to explicitly add the path to the params file under the params keyword when I am running the stage from a different working directory...Not sure if that is a bug or a feature :-)
The
params
path will be interpreted as relative towdir
the same way asdeps
andouts
. So if you don't specifyparams
path it would look forparams.yaml
inwdir
.
yeah, that makes sense. thanks.
Is it then possible to specify the params path on the entire pipeline level? I could them simply write a simple loop in a shell script to go through all the different params files and call dvc repro "on" each one of them. It would still be nicer to do that explicitly in the dvc.yaml
but this would also work.
Is it then possible to specify the params path on the entire pipeline level? I could them simply write a simple loop in a shell script to go through all the different params files and call dvc repro "on" each one of them. It would still be nicer to do that explicitly in the
dvc.yaml
but this would also work.
No, not at the moment. Doing this via templating the way you have it set up now is probably still the best way to accomplish it.
Is that something I could help with perhaps (in case this is a feature you'd like to include)? I am not very familiar with the inner workings of dvc at this level of detail but this (a configurable default params file) does not sound particularly complicated and it would definitely help me a lot so I'd love to help implementing that.
Actually, I still don't quite get how the params
path works. The thing is that I am referencing the path to the params
file in the dvc.yaml
like this
train_test_split:
wdir: ../../../..
cmd: >-
python modules/estimation/train_test_split.py
--params=${paths.params_file}
deps: ...
outs: ...
params:
- ${paths.params_file}: # needs to be set due to a different working directory
- paths.data_all
- train_test_split
I understand that first, dvc looks for the params.yaml
file in the working directory. But how does it actually find it otherwise? I set the path via templating in params
like this
params:
- ${paths.params_file}: # needs to be set due to a different working directory
but the paths
keyword is already a part of the params.yaml
file, the reference is kind of circular...So dvc somehow has to know where to look for the param.yaml
, otherwise it could not use the templating reference. But if it does, why is it necessary to explicitly mention the path? I mean it works, but it seems a bit strange to me.
@tibor-mach This answer might help you understand the differences between params and templating resolving: https://github.com/iterative/dvc/issues/7316#issuecomment-1027703686
Is that something I could help with perhaps (in case this is a feature you'd like to include)? I am not very familiar with the inner workings of dvc at this level of detail but this (a configurable default params file) does not sound particularly complicated and it would definitely help me a lot so I'd love to help implementing that.
I think configuring a default params file could be a good simple feature to add.
The default path is defined here:
And (I hope) that it is the single source of truth.
If you would like to make it configurable, you would need to first add a new config option (https://github.com/iterative/dvc/blob/af649af46276b662b4fa03fd6ab63c36521f28aa/dvc/config.py.
The way I would do it is by updating the DEFAULT_PARAMS_FILE
behavior, probably converting it to a @property
that checks if the config option is set, otherwise return params.yaml
@tibor-mach This answer might help you understand the differences between params and templating resolving: #7316 (comment)
I see, that setup is a bit contraintuitive to me, but I gues I understand the behaviour better now :-)
Is that something I could help with perhaps (in case this is a feature you'd like to include)? I am not very familiar with the inner workings of dvc at this level of detail but this (a configurable default params file) does not sound particularly complicated and it would definitely help me a lot so I'd love to help implementing that.
I think configuring a default params file could be a good simple feature to add.
The default path is defined here:
And (I hope) that it is the single source of truth.
If you would like to make it configurable, you would need to first add a new config option (https://github.com/iterative/dvc/blob/af649af46276b662b4fa03fd6ab63c36521f28aa/dvc/config.py.
The way I would do it is by updating the
DEFAULT_PARAMS_FILE
behavior, probably converting it to a@property
that checks if the config option is set, otherwise returnparams.yaml
Cool, seems simple enough. I'll have a look at it, thanks!
@daavoo Just one more thing...
How is this going to work with dvc.lock
? Am I going to get a single huge dvc.lock
file (say for 10 pipeline runs, each with different params.yaml) or will I get one dvc.lock
for each pipeline (that would be desired, at least by me).
I hacked something like this together using a hydra conf/
folder to first parse a set of experiment parameters using their syntax (like the dvc-supported hydra mode), but the dvc exp
command is exceptionally slow when trying to run a large number of experiments. Instead of tracking outputs from individual experiments, I just output them to a give folder and tracked the whole folder. It's a quick hack that doesn't necessarily have a one-to-one correspondence for inputs and outputs, but I've found it really useful for doing large hyper-parameter searches across multi-stage pipelines, especially with long-running models and search-spaces that benefit from caching. I suppose it wouldn't be hard to implement something like this into that instead of explicit file space configuration like foreach
or matrix
stages, just uses the md5 hash of a set of parameters.
Let's assume you need to do some multi-objective search using hydra where you might have a set number of trials and a parameter space, but the search space is large enough that naming each data, model, metric, plot, manually becomes burdensome. Let's say 1000 trials across 5 hyper-parameters that could be categorical, ints, floats, ranges, distributions, etc.
It shouldn't be too difficult to add syntax to support this kind of reproducible search for example using:
search:
cmd : python example/script.py --multirun ${hydra.sweeper.params} # Normal optuna hydra params
outs:
- ${hydra.sweeper.storage}: # normal hydra parameter
- persist: true (if you want to, for example, change the set of random states without deleting older results)
- ${hydra.sweep.dir} # normal hydra parameter
- ${data.item}.json #finds the md5 hash of the data param dictionary
- ${model.item}.json #finds the md5 hash of the model param dictionary
metrics:
- results/${item}.json # finds the md5 has of all the parameters
params: # This would have to point to a hydra conf folder instead of the normal params.yaml
- conf/default.yaml :#Uses the normal hydra configuration folder
- data
- model
In this way, you can define an arbitrary search with tracked inputs and outputs, but not have to name the outputs explicitly. The one downside to this particular syntax would be that you'd need to create a database (e.g. the hydra.sweeper.storage) rather than exploiting the database format that optuna (as in the link above) generates, meaning you'd need a new database for each output, rather than a table, which is offered by hydra using the ${hydra.sweeper.study_name} syntax in the hydra-optuna plugin. You could run a separate optuna-dashboard server on that database and then query a URL, however, but dvc's support for this kind of parallel search is limited to grid search and explicit file configurations.
The flexibility of the full hydra launcher syntax (supporting distributed queues, multi-objective search, minute joblib configuration, etc) is far preferable to the limitations of dvc exp
in general, so I think extend the foreach/matrix idea to implicit file space configuration would be worthwhile.
In this way, you could test a set of model configurations across several reproducible set of samples without having giant files like results/model.layers=4/model.channels=3/model.epochs=20/model.output=logits/data.random_state/data.preprocessor/data.sampler/
but results/\<hash>.json or results/\<hash>/ depending on the presence of a path suffix.
Hi, I have a setup where I use a single pipeline (with several stages) for training multiple models which are almost the same, but use different training data and parameters.
I currently have a copy of a dvc.yaml pipeline in a folder with the respetice
params.yaml
file used for each model. It looks more or less like thisThis works (I then always run
dvc repro -P
) but I have to copy the pipeline file which makes versioning difficult. The only part that is not (since it cannot be AFAIK) templated is the default params file.I would love to have a dvc.yaml file in the root folder of my project which can be run with several different params.yaml files from several locations. Kind of like
foreach ... do
but on the level of the entire pipeline.Also, I believe I have to explicitly add the path to the params file under the
params
keyword when I am running the stage from a different working directory...Not sure if that is a bug or a feature :-)Thanks a lot!
P.S.: I tried a similar setup with templating all the stages but there are limitations in the way templating and foreach do work right now and also I feel like this would be a more elegant way to do this. The pipelines and the overall architecture are the same, what is different are the training data and (some) parameters, so having an option like "for each params file in list reproduce a separate instance of the pipeline" would make a lot of sense to me (it would them make sense to have separate lock files as well)