iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.58k stars 1.17k forks source link

Nested matrix substitution #9948

Open kosmitive opened 11 months ago

kosmitive commented 11 months ago

Sometimes evaluating a certain metric calculations can take up some time. So we want to cache these results using dvc. For that we consider the following param.yaml file

experiments:
  experiment_a:
    metrics:
      metric_a_1:
         idx: <str>
         kwargs: <dict>
      [...]
  experiment_b:
    metrics:
      metric_b_1:
         idx: <str>
         kwargs: <dict>
      [...]

combined with the dvc.yaml file

stages:
  dispatch_experiment:
    matrix:
      experiment: ${experiments}
      metric: ${experiments.${item.experiment}.metrics}
   cmd: python -m scripts.dispatch_experiment
      --experiment-name ${item.experiment}
      --metric-name ${item.metric}
   outs:
      - ${item.experiment}/${item.metric}.csv

in order to make the looping of matrix notation more flexible.

dberenbaum commented 11 months ago

@kosmitive Do you have a real use case in mind that you could share? It will help understand the issue and context better, think of possible workarounds, and prioritize appropriately.

kosmitive commented 11 months ago

@dberenbaum I updated the issue description above.

@kosmitive Do you have a real use case in mind that you could share? It will help understand the issue and context better, think of possible workarounds, and prioritize appropriately.

Workaround would be to unroll the nested operator in dvc.yaml into


stages:
  dispatch_experiment_experiment_a:
    matrix:
      metric: ${experiments.experiment_a.metrics}
   cmd: python -m scripts.dispatch_experiment
      --experiment-name experiment_a
      --metric-name ${item.metric}
   outs:
      - experiment_a/${item.metric}.csv

  dispatch_experiment_experiment_b:
    matrix:
      metric: ${experiments.experiment_b.metrics}
   cmd: python -m scripts.dispatch_experiment
      --experiment-name experiment_b
      --metric-name ${item.metric}
   outs:
      - experiment_b/${item.metric}.csv
dberenbaum commented 11 months ago

Makes sense @kosmitive, thanks! I don't fully get how you are using metrics in your example. Where do you use idx and kwargs?

kosmitive commented 11 months ago

@dberenbaum idx and kwargs are used as in

def prepare_metric(experiment_name: str, metric_name: str) -> partial:
    metric_name, metric_config in params["experiments"][experiment_name]["metrics][metric_name]
    metric_idx = metric_config["idx"]         # <---------------------|
    metric_kwargs = metric_config["kwargs"]   # <---------------------| Parameters `idx` and `kwargs`
    metric_cls = MetricRegistry[metric_idx]
    return partial(metric_cls, **metric_kwargs)

to prepare the metrics such that they provide a unified interface. MetricRegistry is a dictionary holding the default constructors for the different registered metrics. In that case idx (or we could call it also metric_fn_id) is used to access the constructor in the registry.

dberenbaum commented 11 months ago

And do you want to run all matrix combinations in one run or only select combinations at a time?

kosmitive commented 11 months ago

The metrics should be managed by dvc. Reason for that is, if I there are changes to the plotting stage the results should be reused (as itr takes up to 5 min for evaluating a metric). Imagine adding a new metric, in that case the other metrics shouldn't be recalculated.

dberenbaum commented 11 months ago

What about at the experiment level. Do you run all experiments in a single run or do you want to invoke only experiment_a at one time?

kosmitive commented 11 months ago

On all experiments. The experiment level is a good point, currently we use that pattern https://github.com/iterative/dvc/issues/9948#issuecomment-1728188440.

Regarding integrating it completely into dispatch_experiment.py: dvc creates a folder experiment_a/metrics. The first run executes only metric_1. The second run executes metric_1 and metric_2. One run is one call of dvc exp run. As dvc removes the files in experiment_a/metrics we can't rely on existing data in that folder. What do you think would be the best solution to do it on an experiment level (with caching and reusing values)?