iterative / dvclive

📈 Log and track ML metrics, parameters, models with Git and/or DVC
https://dvc.org/doc/dvclive
Apache License 2.0
165 stars 36 forks source link

Foreach to run multiple experiments #751

Closed JenniferHem closed 5 months ago

JenniferHem commented 10 months ago

Dear dvc team,

in the discord I recently posted a question about using a foreach loop and realized that my intended way of using foreach might not be supported (yet). So I was wondering if the scenario is something which could be supported, or if you have Tipps for me on how to setup my workflow better:

We tried to build a ML pipeline which is easy to use and basically automatically configured so the users do not have to worry about setting up dvc workflows themselves.

We start with a csv file. This file has our ML input and multiple columns. Each of the columns contains a different "y" so a different, but usually related, machine learning task. In order to easily train all the models without needing to run dvc exp run multiple times we use a foreach loop to train the models. Our stage looks like this:

trainBest:
    foreach: ${endpoints}  # Specified in data_config.yaml
    do:
      cmd: python ./scripts/02_train_best_model.py --endpoint ${item}
      deps:
      - ./config/data_config.yaml
      - ./modeldata/model_performances/${item}_performance.tsv
      - ./data/${data_path}    # Specified in data_config.yaml
      - ./scripts/02_train_best_model.py
      outs:
      - ./modeldata/final_models/${item}.joblib
      - ./modeldata/final_models/${item}_config.json

Now I thought that ultimately this would run an experiment such that I have n lines of experiments, one for each foreach loop. However, this is not the case.

My current solution is to use dvclive and track each loop in a different directory, so the end of my 02_train_best_model.py script contains the following lines now (as suggested in discord) :

metrics_dir = f"./modeldata/final_models/{endpoint}"
with Live(dir=metrics_dir, save_dvc_exp=True, exp_name=f"Performance_hyperparam_{now_str}") as live:
    for col, value in best_performance.items():
        live.log_metric(col, value)
    live.end()

However, since I run the script with dvc exp run the save_dvc_exp and exp_name parameters are ignored, I was hoping that via those I would ultimately end up with the n lines (one for each y) when using dvc exp show.

Running without the foreach loop is of course possible but leaving my users with just running "dvc exp run" to get all the desired models and evaluations just seems like such an easy thing to do instead of again compiling bash scripts to do an external loop.

I would be happy if you could guide me or let me know if this is a use-case you would consider for the foreach stages.

dberenbaum commented 10 months ago

@JenniferHem Have you tried keeping them all "horizontally" in one line? How related is each experiment? How would you like to be grouping each "run" that combines the endpoints if you were to have a line per endpoint?

JenniferHem commented 10 months ago

@dberenbaum thanks for the answer. We work in the bio field so the endpoints really are considering similar biology. Actually while experimenting a bit I realized, that with the setup I explained using dvclive I do get the horizontal tracking as you mentioned, I just had a displaying issue which did not show me the full horizontal table. Considering that actually all endpoints are trained in the same run it makes sense to save it like this. One thing I am wondering about, which might be me not fully grasping the capacity is if there would be a way to track the same metrics files etc as I do above, keeping them horizontally but without needing to spread the metrics.json files manually via dvclive as an additional factor. Does that make sense?

dberenbaum commented 10 months ago

It makes sense. You should be able to use Live(resume=True) to keep appending to the same dvclive folder for each model run. There was a bug that was fixed in https://github.com/iterative/dvclive/pull/740 that was preventing dvclive from preserving metrics when resuming, so it will not work as expected until the next release unless you use the upstream branch (pip install git+https://github.com/iterative/dvclive.git).