Closed dmpetrov closed 3 years ago
In my case I stored these global variables (vocabulary file, cleaned data file, label name file) in a configuration file as default. And the processing cli can accept all these variables to override the defualt one. But my configuration file parsing and merging is coded inside the program, and had to be re-coded in every program.
Before I knew about multiple-stage pipelines I used the params
functionality and hacked together a script to build a richer dvc run
command. I used the same dot.notation
you can use in the params command line args ie: dvc run -p params.yaml:data.output,train ...
In the end my params.yaml
looks like:
data:
url: s3://mydata.npz
out_path: data/mydata.npz
data/-o: # Each element in this list will be
# added as a -o <arg>
# For access strings within the current stage
# you can also use '.out_path'
- data.out_path
data/script:
- python src/cli/data_download.py
- --data-size
- .url # equivalent to data.url
- --out-path
- .out_path # equivalent to data.out_path
train:
lr: 0.001
ckpts: arteftacts/ckpts
train/-d:
- data.out_path
train/-o:
- artefacts/tflogs
train/script:
- python train.py
- --data
- data.out_path
- --ckpts
- .ckpts
- --lr
- .lr
I'm not suggesting the multi-stage pipeline changes. I guess all I'm saying is:
After Thought
After sleeping on it I thought, 'I can use jinja2 for this':
stages:
data:
cmd: python src/cli/data_download.py --data-size {{ args.data.size }} --out-path {{ args.data.out_path }}
wdir: ..
outs:
- {{ args.data.out_path }}
train:
cmd: python src/cli/train.py -d {{ args.data.out_path }} --params {{ args.train.params }} --log-dir {{ args.train.lodir }} --metrics-path {{ args.train.metrics }} --ckpts {{ args.train.ckpt_path }}
wdir: ..
deps:
- {{ args.data.out_path }}
- {{ args.train.params }}
metrics:
- {{ args.train.metrics }}
outs:
- {{ args.train.logdir }}
- {{ args.train.ckpt_dir }}
I then load my args.yaml
file and convert it into a namedtuple
(so I can use the dot.notation
). Bingo bango, we have our filled in .dvc
file
Unfortunately, YAML anchors can only solve part of this: You can use them to fill in repeating dependencies like this:
stages:
process:
outs:
- path: &cleansed cleansed.csv
...
train:
deps:
- path: *cleansed
...
However, there doesn't seem to be an easy way to also "paste" them into a full command, unless we write some custom YAML parsing.
@dsuess can jinja2-like approach from @tall-josh potentially solve this issue?
@dmpetrov I like the idea, but I see two drawbacks:
To alleviate the first, maybe we can add a special section variables
to the Dvcfile
like this:
variables:
data:
size: 10
out_path: ./data/
...
stages:
data:
cmd: python src/cli/data_download.py --data-size {{ args.data.size }} --out-path {{ args.data.out_path }}
wdir: ..
outs:
- {{ args.data.out_path }}
train:
...
and then only template the stages
section with those variables. No templating would be allowed for that first section.
For the second one, we could either look into not allowing certain functionality (like conditionals or loops) in jinja2. Alternatively, we need to implement our own simple templating engine that only supports a very limited sets of features. So far, I only see variable substitution to be necessary.
Alternatively, we could implement our own PyYaml tag to support pasting like this
@dsuess, good idea, but there is a questionvariables
need to be shared among DVCfile
s, many of them ( for me embedding size, vocabulary file, sequence length and etc.) must stay the same in all stages. Maybe we should make the DVCfile
inherit parameters from its ancestors.
There are definitely use cases for both in-file variables and external variable files. I agree with @dsuess in that full templating is probably not the way to go as a proper solution.
Maybe it's just a terminology thing but @karajan1001, if your thinking full blown inheritance then I think things could get messy. A simple way of import
ing from 1 or more files cloud be nice.
vars:
IMPORT:
somevars: some/vars.yaml
morevars: more/vars.yaml
data:
size: 10
out_path: ./data/
...
stages:
data:
cmd: python src/cli/data_download.py --data-size {{ vars.data.size }} --out-path {{ vars.data.out_path }}
wdir: ..
outs:
- {{ args.data.out_path }}
train:
...
outs:
- {{ somevars.something.x }}
- {{ morevars.someotherthing.y }}
@dsuess the variables section makes total sense. I see that as a simple form of variables with no param file associated. We definitely need it and this part has a higher priority.
It looks like all of us on the same page regarding Jinja - full support is too much.
Also, I like @tall-josh's idea of importing which unifies the explicit vars definition:
vars:
IMPORT:
somevars: some/vars.yaml
morevars: more/vars.yaml
It looks like the high-level design is ready :) and we need to agree on the syntax and then a templating engine.
On the syntax side, I see these options:
"$vars.somevar"
{{ vars.somevar }}
!var [*vars.somevar]
What are your thoughts on this?
(3) seems like a more formalized way of doing (1)
PS: dvc run
can use a bit different syntax or we can decide not to support variables on the command level - let's keep this discussion separately.
Hi @dmpetrov Yep, we're on the same page. I like the first syntax simply because it's concise.
@karajan1001 Agreed. I think this change make most sense with #3584
@dmpetrov If we want to share variables between files, I think @tall-josh solution is better than just implicitly passing variables through from the dependencies. That would also cover the use case of having a single variables-file that's shared throughout the repo. However, I also think that not passing them between files wouldn't be a problem, if the multi-stage files from #3584 are the new standard. This way, files would scope variables nicely.
Regarding the syntax: I don't have any preference as long as the files are still valid YAML. The first one will interfere with standard shell variable substitution in the command, which might be a good thing as using non-tracked variables in the command will break reproducibility. It might also be a bad thing if there's a valid use case for having shell-variables in the command.
@dsuess I'm not sure multi-stage is quite at the point of being the new standard, though I think it will get there soon. Someone correct me if I'm wrong, but as of a few days ago multi-stage ( #3584 ) did not support --params
:-( . Something I'm been meaning to bring up their thread. I'll do that now I guess.
@dsuess agree, defining variables from command line is not the high priority feature. We should have a great solution in the pipeline file. Also, a good point about env variables.
@tall-josh it would be great to hear your params ideas! You can post it on the multi-stage issue or just here.
@dmpetrov. No probs, I've raised the issue ( #3584 ), but it's not really an idea. It's more along the lines of keeping the multi-stage consistent with the regular single-stage .dvc
files. ie:
dvc run --params myparams.yaml:train "cat myparams.yaml"
Produces a dvc file:
md5: d978fc54ad4ee1ad65e241f0bd122482
cmd: cat myparams.yaml
deps:
- path: myparams.yaml
params:
train:
epochs: 10
lr: 0.001
Whereas using multi-stage
dvc run -n stage1 --params myparams.yaml:train "cat myparams.yaml"
Results in a DvcFile that looks like this. Plus a lock file. Neither of which refer to the train
key in myparams.yaml
stages:
stage1:
cmd: cat myparams.yaml
deps:
- myparams.yaml
I've directed @skshetry to this thread too as I think the two issues would benefit from one and other.
Hi @tall-josh. We now have params support for the pipeline file (as we are calling it at the moment). We are planning to release it soon (with a public beta in this week).
You have already seen the file format, so, I'll just enter to the discussions of variables. I'd propose not having an indirection of reading variables from the params file, but from the pipeline file itself. The pipeline file is very readable and clean, and we'll strive to make it so in the future.
And, we can introduce a variables
section in the pipeline file, which can be used for reading params by the dvc
.
This way, there can be clear separation between params required by your scripts vs required by
dvc
so as not to duplicate information (but, again for read-only purpose, you can just add to
the pipeline file, no problem on that).
So, it should look something like:
variables:
data:
size: 10
out_path: ./data/
stages:
data:
cmd: python src/cli/data_download.py --data-size {{ data.size }} --out-path {{ data.out_path }}
wdir: ..
outs:
- {{ data.out_path }}
Note that, I'm not focusing on templating formats (as it's likely going to be tied with the implementation), but we should try to make it logic-less. Providing logic will only make it complicated, and ugly.
cc @karajan1001 @dsuess
Seems that most of you agree that reusable yaml-variables should be defined directly in the pipeline file (I do too). So maybe you should rename the issue @dmpetrov ?
@skshetry Amazing news! I can't wait to give it a crack!
@skshetry I have finally got some overhead to give the params support for the pipeline file. Do you have an example somewhere?
EDIT:
Ahhh I misunderstood your comment. If I now understand correctly; There is now --params
support for multistage dvc files, but you're still working on templating variables.
@tall-josh yes, the templating is still in process. Now we are busy by releasing DVC 1.0 https://dvc.org/blog/dvc-3-years-and-1-0-release The parameter is a good feature for the next step. DVC 1.1 I think.
@elgehelge I think there is no need for renaming this issue. It was created specifically to find a solution for using params.yaml
in the pipeline. It is a useful approach when you need to localize all the configuration in a single file - a very common pattern. Also, it might be useful when some external tools can properly read and modify simple parameters files params.yaml
but do not understand the more complicated format of dvc.yaml
. And we already good a lot of good ideas on how to implement it.
But I agree that the "static" params is a super useful tool and should be the first priority (I had no doubt with that). I can create a separate issue for the "static" parameters case.
I like how simple variables can be implemented:
variables:
data:
size: 10
out_path: ./data/
Why don't we use the same approach for importing all the params:
params_variables: [params.yaml]
stages:
mytrain:
cmd: python train.py --batch-size {{ train.batch_size }} --out-path {{ train.out_path }}
outs:
- {{ train.out_path }}
As an option, we can import by param name or param section name:
params_variables:
- params.yaml: [process.threshold, train]
In this case, dvc.lock
should include all the params that were used in this stage like:
params:
train.batch_size: 2048
train.out_path: "out/model.p"
What do you think folks?
@dmpetrov Thanks for the clarification. In the meantime, I had a crack at implementing a vars
section at the top of the multistage dvc.yaml
file using Jinja2. It's a bit janky, but it appears to have the desired effect.
https://github.com/tall-josh/dvc/tree/master/TEMPLATING_DEMO
The dvc.yaml
file looks like:
vars:
data: data.txt
base_dir: base_dir
stage1_out: stage1_output.txt
stage2_out: stage2_output.txt
splits:
train: 0.89
eval: 0.3
stages:
st1:
deps:
- {{ vars.data }}
outs:
- {{ vars.base_dir }}/{{ vars.stage1_out }}
cmd: >-
mkdir {{ vars.base_dir }};
cat {{ vars.data }} > {{ vars.base_dir }}/{{ vars.stage1_out }}
st2:
deps:
- {{ vars.base_dir }}/{{ vars.stage1_out }}
outs:
- {{ vars.base_dir }}/{{ vars.stage2_out }}
cmd: >-
echo "train: {{ vars.splits['train'] }}" > {{ vars.base_dir }}/{{ vars.stage2_out }};
cat {{ vars.base_dir }}/{{ vars.stage1_out }} >>
{{ vars.base_dir }}/{{ vars.stage2_out }}
Hello everyone! I'm a bit late to the party, but here are some ideas about using variables for looping (original discussion started in https://discuss.dvc.org/t/using-dvc-to-keep-track-of-multiple-model-variants/471).
In a nutshell, I want to use the same pipeline to train models using different groups or sets of data. The steps are the same, but the data is slightly different and the outputs have different paths.
Having vars
or a params.yaml
file would be great! I like the idea of declaring how the DAG should look like and what parameters it needs to run
From the original post:
params:
group1:
name: "group 1"
something_else: "foo"
group2:
name: "group 2"
something_else: "meh"
group3:
name: "group 3"
something_else: "gwah"
stages:
get_some_data:
cmd: ./get_data.sh {{params.name}}
deps:
- get_data.sh
- query.sql
outs:
- data.csv
train_model:
cmd: Rscript train.R {{params.name}} {{params.something_else}}
deps:
- data.csv
outs:
- models/{{params.name}}/model.xgb
metrics:
- metrics/{{params.name}}/model_eval.json
No ideas at the moment regarding how to declare that some parameters should be looped on.
I imagine dvc repro
would then run group-by-group the whole DAG from start to finish (as opposed to stage-by-stage).
I've been mulling this over too. Would be super handy!!
This only focuses on parametrization and possible solutions that were explored.
Similar to @tall-josh's implementation on #4463.
build-matrix
and for-each
loops). See: #331, #1018, #1462 and #4213.{{ var }}
, which is error-prone in YAML as it's a JSON and requires enclosing with " "
(quotes). But, it supports custom syntax too.str
recursively)build-matrix
and for-each
loops).if
conditionals and for
constructs. We only need a substitution for now.Omegaconf is a YAML based hierarchical configuration system, i.e. it can load config from multiple files and merges them together. It also supports interpolating by referencing different parameters but within the config.
${}
based syntax, which I personally prefer.ConfigParser.ExtendedInterpolation
and friendsSuited for ini
formats. Syntax is ${section:key}
.
Looked into Thinc's, which extends it to support dot-based syntax.
I was unable to find how Github Actions parses them. We do know that:
${{ var }}
syntax (is it because it clashes with ${}
bash expand syntax? Or, is there more to it?), andThough, we do have a tool called act
that allows us to run Github Actions locally. It is written in Go, and it seems it uses a javascript-interpreter to evaluate those expressions.
Similarly, another tool (written in JS) parses those expressions into a javascript AST.
str.format
, string.Template
and friendsstr.format
uses { var }
syntax and it requires us to "sandbox" it properly.
Still, hate the syntax. string.Template
supports $var
and ${var}
syntax, but does not support accessing attributes (eg: ${x.y}
).
I didn't find any good "readymade" libraries (or, a combination thereof) that could:
So, I propose we roll our own custom interpolation with custom syntax. /cc @iterative/engineering
The proposal will soon follow.
Note: I have to point out that
${}
interpolation is a proposal, rest are the part of the discussions, that'll eventually be implemented on top of interpolation.
I propose the syntax ${ var }
for interpolation, as many users will already be well familiar with the syntax and does not have problem like { var }
has in YAML.
We have to handle the conversions (to list/dict/bool/string/int/floats et cetera) ourselves as well.
Regarding the imports of the variables, let me just show the example dvc.yaml
:
args:
imports:
- params.yaml
- params.toml
- params.json:Train
# all of this are merged together (note: only dicts are merged).
constants:
# these are also merged together
prepare:
script: src/train.py
stages:
train:
cmd: python ${prepare.script} ${lr} ${layers}
deps:
- data/data.xml
- ${prepare.script}
params:
- lr
- layers
outs:
- data/prepared
Let's say, params.yaml
has following values:
lr: 0.042
layers: 8
classes: 4
You can see that, there's a duplication for params
dependency and arguments in cmd
section. So, as per #4525, it'd be nice to remove this duplication and make dvc understand them automatically.
This is still on the discussion, but @dmpetrov suggested to use ${{ var }}
syntax, through which DVC should auto-track them (there will be entries on dvc.lock, it just won't be on dvc.yaml in params
section). I quite like this syntax, although I have a few alt. syntaxes:
${}
for expansion, and $()
for auto-track and expansion.${ }
for expansion, and ${track: prepare.seed}
for auto-track and expansion.${}
for expansion, and ${ track(prepare.seed) }
for auto-track and expansion (similar to Actions).Please do suggest, if you have a good syntax. Also note that, during discussion w/ @dmpetrov, we decided not to solve #4525, which proposes to auto-track but not expand variables, and instead do both of those together.
So, this is something that will be useful during foreach
and/or build-matrix
that will need to interpolate values from imported-params. (if not already useful for params
substitution).
This is very early in the discussion, but I'll post what @dmpetrov suggested:
args:
...
stages:
foreach: ${ outputs }
prepare:
cmd: python prepare.py ${ item }
deps:
- raw/${ item }
outs:
- processed/${ item }
This will support iterating over dicts, and item
or something similar might be a "reserved" keyword inside looping, and for stage name, we could append the key to it's declared name (i.e. prepare-file1.txt
and prepare-file2.txt
).
You might be able to provide values directly as well. Something that I have in mind:
foreach:
values:
x: 1
y: 2
stage-name:
...
Though, if I look foreach
through this usecase, I think it should be just inside of the stage rather than one-indentation outside as in the previous example.
There are a few open questions:
cmd: python script.py ${i} ${{ train${i} }}
set:
data: ${prepare.data} # would be accessible as ${ data }
...
foreach
? Do we need a special name/keyword for it?include
/exclude
of the foreach
values?foreach "i = range(0, 5)":
prepare:
cmd: python script.py
params:
- train${i}
Some of the inspirations will for sure be taken from Ansible and pypyr.
Again, we just discussed this, so it's very early and take this comment on foreach
with a grain of salt. I just wanted to share what we've discussed so far.
Did I tell you, that the config can be overridden from CLI? Eg:
$ dvc repro --params prepare.seed=3
@skshetry thank you for the deep investgation and all the insights you found 💎
It feels like we are getting closer to the final design. I just want to share a few comments and my personal preferences.
args: imports: - params.yaml constants:
args
is a bit too specific - the usage of the variables is wider than only args in cmd
. I'd prefer vars
or params
over args
. I'd prefer separate params (params file level abstraction) and vars (pipeline level abstraction). In that way we can think about using vars
for dvc.yaml in all the places like dvc repro --vars prepare.seed=3
.
${}
for expansion, and$()
for auto-track and expansion.${ }
for expansion, and${track: prepare.seed}
for auto-track and expansion.${}
for expansion, and${ track(prepare.seed) }
for auto-track and expansion (similar to Actions).
The first option might look confusing for Unix users since both ${}
& $()
have the same meaning in Unix. All the other options look good to me. My personal preference ${}
+ ${{}}
or ${}
+ ${param: prepare.seed}
(track:
is ok).
foreach: ${ outputs } prepare: cmd: python prepare.py ${ item }
We might need some option to overwrite item
like
foreach: ${ outputs }
item: input_file
prepare:
cmd: python prepare.py ${input_file}
- Do we require local "variable" of some sort, accessible inside each iteration of the loop? It could be something that user could set:
set: data: ${prepare.data} # would be accessible as ${ data }
It is syntactic sugar. Not a priority at all.
- How to reference the top-level of "args", the whole merged value to loop through in
foreach
? Do we need a special name/keyword for it?
I'n not sure I understand this.
- Do we need to support custom
include
/exclude
of theforeach
values?
It seems a bit too advance. We should expect a very limited list of values on the list like build matrix in CI systems.
- Also, we might need to support some range-like cases, roughly following (as suggested by @dmpetrov):
foreach "i = range(0, 5)": prepare: cmd: python script.py params: - train${i}
It might look better as a separate option:
foreach:
range:
var: i
from: 0
to: 5
prepare:
cmd: python script.py
params:
- train${i}
A couple more open questions:
Looks like I have some catching up to do!
Looks like I have some catching up to do!
:slightly_smiling_face: @tall-josh, I was reviewing your pull request. Thank you so much to help this move forward.
@skshetry awesome research, a few questions:
1. Why do we need imports
? At least initially. Keep the same name resolving mechanism as we use for parameters. We won't need to merge stuff, we can simplify constants section significantly.
2. Can we avoid creating a separate syntax to track something?
E.g.y treating external files as params, and constants section as ... well constants that serve templating purposes. We kinda force users to define tracked and non-tracked things in different places.
But to be honest I think that duplicating tracked section explicitly is better than introducing "magic" and some special syntax- that will make implementation harder, learning curve steeper.
3. loops - need to wrap my mind around it. First, what task are we solving with this? It looks like we solved the case where we generate multiple stages. Does it solve the case when we need to iterate on multiple items within a single stage? Should we prioritize this and keep pipeline-construction somewhere outside of dvc.yaml?
1. Why do we need
imports
? At least initially. Keep the same name resolving mechanism as we use for parameters. We won't need to merge stuff, we can simplify constants section significantly.
It seems that there are two (there might be more) needs for parametrization. One is a simple loop/build-matrix scenarios that do not have to do anything with parameters, another one is building different datasets/models based on different parameters in params.yaml
. So, these are not a separate thing, one or the other could be used for parametrization/loops.
One example that @dmpetrov provided was building models with different sets of params kept as:
us:
filename: something
ch:
filename: something-something
And, using these to build 2 different stages/models with different sets of parameters (that are inside of their respective keys).
2. Can we avoid creating a separate syntax to track something? Yes, I'd love to do that, but as I said above, they seem to kind-off interconnected. If we create a hard-line between params and constants, then this might work.
First, what task are we solving with loops?
I see it mostly as a generic build-matrix. But, it looks like we need different syntax/semantics based on how people try to use it and keep their parameters on params.yaml. (simple loops, nested loops/build-matrix, iterating just by the keys, etc.).
It looks like we solved the case where we generate multiple stages. Does it solve the case when we need to iterate on multiple items within a single stage?
I did not quite get these questions/statements.
Should we prioritize this and keep pipeline-construction somewhere outside of dvc.yaml?
Outside of the dvc.yaml
, could you please elaborate this?
The biggest improvement of the parametrization feature I've experienced is that I can now scale the workflow by only modifying params.yaml
:
This defines the minimal repro workflow (For simplicity I've removed a lot of params here), I use this for my gitlab CI setting:
# params.yaml
models:
gaussian_15:
name: gaussian_15
data:
noise_level: 15
train:
epochs: 2
datasets:
train:
name: train
src: data/DnCNN.zip
path: data/train/rawdata
Set12:
name: Set12
src: data/DnCNN.zip
path: data/test/Set12
testcases:
set12_gaussian_15:
model_name: gaussian_15
noise_level: 15
dataset: Set12
In this setting, I trained one model, and test the model on one dataset (Set12).
the corresponding dag is
I could easily repeat some of the contents in params.yml
. In my case, I now trained three models, and test each model on two datasets(Set12 and Set68).
# params.yaml
models:
gaussian_15:
name: gaussian_15
data:
noise_level: 15
train:
epochs: 50
gaussian_25:
name: gaussian_25
data:
noise_level: 25
train:
epochs: 50
gaussian_50:
name: gaussian_50
data:
noise_level: 50
train:
epochs: 50
datasets:
train:
name: train
src: data/DnCNN.zip
path: data/train/rawdata
Set12:
name: Set12
src: data/DnCNN.zip
path: data/test/Set12
Set68:
name: Set68
src: data/DnCNN.zip
path: data/test/Set68
testcases:
set12_gaussian_15:
model_name: gaussian_15
noise_level: 15
dataset: Set12
set12_gaussian_25:
model_name: gaussian_25
noise_level: 25
dataset: Set12
set12_gaussian_50:
model_name: gaussian_50
noise_level: 50
dataset: Set12
set68_gaussian_15:
model_name: gaussian_15
noise_level: 15
dataset: Set68
set68_gaussian_25:
model_name: gaussian_25
noise_level: 25
dataset: Set68
set68_gaussian_50:
model_name: gaussian_50
noise_level: 50
dataset: Set68
:tada: the workflow gets scaled automatically without touching any of the other files:
It's still unclear to me what's the best practice to parameterized the workflow. But I'm pretty satisfied with how it makes the code clean.
I'm still playing with dvc and the parametrization. When I feel ready, I might summary them up as a blog.
params.yaml
The params.yaml
can be further simplified if dvc
borrows travis build matrix semantics, e.g.:
stages:
build:
foreach-matrix:
- ${model}
- ${dataset}
do:
...
or ansible filter semantics, e.g.,:
stages:
build:
# item[0] := ${model}
# item[1] := ${dataset}
foreach: ${model} | zip(${dataset})
do:
...
I don't have a preference here, I like both specifications.
For normal data science projects, it might be better to adopt the Travis way because data scientists can adapt it more easily; they normally are not good programmers :fearful: I'm not sure of it, though.
As the workflow can now scale up very easily, it becomes a pain to queue all jobs into one dvc repro
sequentially. I still don't have a good sense of how this can be done programmatically. But having a loose read lock(#4979) sounds like a good start.
~I haven't followed all the discussions in #755 yet~; it can be quite challenging to let DVC figure out what is the best parallelization strategy. So my intuition is that, with the current dvc repro -s
functionality, it might be easier to put into another YAML files:
Edit: This seems to be option 1 that @dmpetrov mentioned in https://github.com/iterative/dvc/issues/755#issuecomment-561031849
# jobs.yaml
jobs:
- name: extract_data@*
limit: 1 # 1 means no concurrency, this can be set as default value
- name: prepare@*
limit: 4 # at most 4 concurrent jobs
env:
JULIA_NUM_THREADS: 8
- train@*
limit: 4
env:
# if it's a scalar, apply to all jobs
JULIA_NUM_THREADS: 8
# if it's a list, iterate one by one
CUDA_VISIBLE_DEVICES:
- 0
- 1
- 2
- 3
- name: evaluate@*
limit: 3
env:
JULIA_NUM_THREADS: 8
CUDA_VISIBLE_DEVICES:
- 0
- 1
- 2
- name: summary
extract_data@*
follows the same glob syntax in #4976
P.S. I don't have access to slurm cluster so my ideas on parallelization are basically on a local machine. It's not very clear to me whether distributed computation should be handled by dvc or by concrete languages/toolboxes that dvc calls.
Hi, everyone. The parametrization feature is complete, at least the things that we planned for the next release. We are hoping to release a pre-release version this week. So, it'd be really helpful if you guys could provide feedback.
Are there docs or examples for this? Does it follow @johnnychen94 's latest comment? If there are no docs but I can get a nudge in the right direction I could put together a Notebook or something.
Ahhh think I found what I'm looking for in tests/func/test_run_multistage.py:test_run_params_default()
Are there docs or examples for this?
@tall-josh, it's in the wiki as mentioned here: https://github.com/iterative/dvc/wiki/Parametrization
@johnnychen94's example also works.
Ahhh think I found what I'm looking for in tests/func/test_run_multistage.py:test_run_params_default()
What are you looking for exactly? I don't think it has anything to do with parameterization.
Thanks @skshetry that wiki perfect. Yeah, the test_run_params_default()
did not give me what I want.
With the introduction of the new multiple-stage pipeline, we will need to find a way of defining variables in the pipeline. For example, the intermediate file name
cleansed.csv
is used from two stages in the following pipeline and it needs to be defined into a variable:We need to solve two problems here:
train
stage), not from the command line (like in theprocess
stage).We can solve both of the problems using a single abstraction -
parameters file variable
:This feature is useful in the current DVC design as well. It is convenient to read file names from params file and still define dependency properly like
dvc run -d params.yaml:input_file -o params.yaml:model.pkl