Add support for combining metrics

mmeendez8 commented 3 years ago

It is always nice to be able to combine different plots from the same experiment. The classical example would be to compare train and validation loss or the relation between learning rate schedule changes and training loss

Suor commented 3 years ago

As far as I understand you may achieve it with custom templates within DVC. You'll need to write a little vega json or probably copy default one and add something there. Once you'll have it, assigned it to your data file with dvc plots modify --template and saved it to git both dvc plots show and Studio will show it like this. You might need to hardcode field names(s) for y axes into template though, so you custom template would be of a limited reuse.

Suor commented 3 years ago

@jorgeorpinel do we officially support such scenario in DVC? If yes maybe we should add this to docs.

mmeendez8 commented 3 years ago

Yes, you are definetely right @Suor. I was just wondering that if this is gonna be the main UI for DVC it might be necessary to find an easier way to achieve this target, specially if you pretend to move users from another popular tools as MLFlow that allow this task.

I can also add that using custom templates could end up in a lot of boilerplate too... you would have to move the same code over and over between repositories so it might be better to enable this feature here.

Suor commented 3 years ago

Plots in Studio are in an early phase now, we basically show whatever DVC shows. That is the question we haven't resolved yet how far do we want to move away from DVC and what types of things we should add here as opposed to both here and into DVC. Plots in DVC are also evolving.

Suor commented 3 years ago

Thinking of this, some template stored on Studio side - provided by platform or by user or generated via UI - linked to any CSV or JSON or other datafile is a valid use case on its own.

mmeendez8 commented 3 years ago

Plots in Studio are in an early phase now, we basically show whatever DVC shows. That is the question we haven't resolved yet how far do we want to move away from DVC and what types of things we should add here as opposed to both here and into DVC. Plots in DVC are also evolving.

I see, I was not conscious of this debate.

Thinking of this, some template stored on Studio side - provided by platform or by user or generated via UI - linked to any CSV or JSON or other datafile is a valid use case on its own.

Yes that would be a plausible solution for keeping studio and DVC "synchronized". Maybe it will make things a bit difficult in the future, I am thinking about the difficulty of handling all those automatically generated files for all plausible combinations... Anyway I get your point, it seems a larger discussion is needed here

Suor commented 3 years ago

This seems like a common enough case. Making it easier in dvc also makes sense, i.e. providing a more generic template and some extra dvc plots modify options. What do you think @dberenbaum? Maybe even adding a possibility to use custom keys within plot props to be passed to template and rendered there - this is to enable users writing their own custom templates, which are generic/reusable.

dberenbaum commented 3 years ago

Thanks @Suor. I don't think a user can do this in dvc now even with custom templates since only a single series is expected. It has come up before and makes sense as a useful feature.

There are a couple ways I can imagine achieving this within dvc:

Have training and validation loss in separate files and allowing dvc plots diff between the two (or more) files. See https://github.com/iterative/dvc/discussions/5808 for a discussion/proposal on that.
Have training and validation loss in the same file, supporting more than one y-axis field, and adding a template for multi-series plots.

Both sound potentially useful. A couple of reasons I'd probably prioritize the first approach:

It seems easier to implement quickly since it's just adding an option to diff between file paths instead of revisions.
DVCLive is currently setup to have a single series per plots file and to separate training and validation into different paths.

cc @pared

Suor commented 3 years ago

I don't think a user can do this in dvc now even with custom templates since only a single series is expected

As far as I can see all the data, i.e. all CSV columns, are passed to vega, so if you hardcode field names there then you can do anything. If data comes from several files then it's not possible though since data file being the plot is part of how even dvc.yaml stores it. So option 2 is way easier to implement I believe.

dberenbaum commented 3 years ago

We could do both or figure out what makes more sense for users. Do they want to be able to plot across files or within one file, and which is a better UI?

Different files

UI might look like dvc plots diff --no-index model1_roc.tsv rev:model2_roc.tsv.
Natural for something like training and validation data that might be stored separately.
No obvious way to save this type of plot configuration for future use since config is currently tied to a path.
No guarantees that the plots config (what if they specify different templates, x-axes, etc.?) or underlying data are compatible.

Same file

UI might look like dvc plots modify -y train -y val (feel free to suggest something different).
Natural for something like multiclass roc plots that would be stored in one dataset.
Unclear how to diff between revisions work if there are already multiple series on the plot.

Combined approach

@pared has suggested a syntax like dvc plots show -y file.csv -y rev1:file.csv -y rev2:file.csv.

Covers both scenarios.
Need to verify how this works (I think the column names are missing unless I'm misunderstanding).
Like comparing different files, it's unclear if there's a way to save this type of plot config.
Might add complexity for simple scenarios.

Suor commented 3 years ago

No obvious way to save this type of plot configuration for future use since config is currently tied to a path.

There is one obvious way - make a notion of a plot independent and make a separate entry in dvc.yaml for plots, refer data files, templates and props there for each plot. This was briefly discussed when we implemented plots initially, but it was easier to implement attaching props to data files, also it was argued that that one is more intuitive and closer to how people operate.

No guarantees that the plots config (what if they specify different templates, x-axes, etc.?) or underlying data are compatible.

Same as now. We show error when trying to plot both in DVC and Studio. This is also complicated by the fact that data file might change over time, i.e. some columns may disappear or be renamed or props changed, which also means even if props and columns are consistent within a commit they might not be across them.

Unclear how to diff between revisions work if there are already multiple series on the plot.

We can use facets either by revision or by y. Alternatively use different line styles. But we come into territory of combinatorial explosions, variability and user preferences here.

Any command line syntax solution, which avoids saving to dvc.yaml, won't show up in Studio. This will mean we would need to invent out own UI and store things ourselves, while in command line people will need to use bash scripts or history to replot things.

pared commented 3 years ago

Any command line syntax solution, which avoids saving to dvc.yaml, won't show up in Studio. This will mean we would need to invent out own UI and store things ourselves, while in command line people will need to use bash scripts or history to replot things.

Also, that does not seem to make too much sense from DVC perspective. I mean, thats the point of version control, to save things for later use.

The problem here is that on one hand, we would like DVC commands to provide tight integration with git and revisions, so that we can easily compare some assets (that was the initial driving force behind plots, and hence the behaviour of diffing only files with same name) and now we would like to compare different files from different revisions. The latter approach concept does not go well with the former.

make a notion of a plot independent and make a separate entry in dvc.yaml

If we want to satisfy both ideas, that seems to be the only way - maybe we should store just plot configuration and require user to provide data for particular revisions:files when they use plots?

dberenbaum commented 3 years ago

now we would like to compare different files from different revisions

We may be getting ahead of ourselves here. I haven't yet heard of (nor can I think of) a use case where comparing different files from different revisions is actually needed. Doing one or the other may be sufficient.

make a notion of a plot independent and make a separate entry in dvc.yaml

Having a plots section at the top level of dvc.yaml might happen, but I think the keys are still likely to be file paths for now. If we want to fully decouple plots configuration from file paths altogether, I'm not sure exactly how that should look or whether it's worthwhile. It's probably a separate discussion that goes beyond combining metrics.

DVC could add support for both diffing between files and showing multi-column plots within a file.

Diffing between file paths:

# dvc.yaml
plots:
- train_loss.csv:
    x: epoch
    y: loss
- val_loss.csv:
    x: epoch
    y: loss

dvc plots diff --no-index train_loss.csv val_loss.csv plots a diff just like comparing revisions.
--no-index is not an intuitive name, so open to other suggestions even though it would break git consistency.
Throw an error if configs don't match.
Plotting this in Studio doesn't seem much different to me than existing diff plots.

Plotting multiple columns within a file:

# dvc.yaml
plots:
- loss.csv:
    template: multiline
    x: epoch
    y:
    - train
    - val

dvc plots show plots both lines on the same plot using https://vega.github.io/vega-lite/docs/repeat.html.
dvc plots diff makes a facet grid of the plots (similar to confusion matrix).

pared commented 3 years ago

So I guess we are discussing here versatility vs user experience. We move targeting data from file_name to column_name. The question is whether there will come time when someone wants to compare val_loss with train_loss. Then we will be back to discussing very generic approach. Which now is not even dvc plot diff revision:file_path revision2:file_path but even dvc plot revision:file_path:column revision2:file_path2:column2

shcheklein commented 3 years ago

For the record one more user was asking about this feature:

i want to plot loss and val_loss data in same graph on dvc studio. how do i command plots modify? cc @dberenbaum - after the images if we have capacity, let's try to think together if we should improve this on the DVC side first or do a custom wizard (potentially with an ability to save its state back into repo) on the Studio side. I think there were some good suggestions on the DVC end and they didn't look too heavy.

Prioritizing this since, plots are p1 for us at the moment.

dberenbaum commented 3 years ago

For reference, see this proposal from @Suor in https://github.com/iterative/dvc/discussions/5980#discussioncomment-1026072:

plots:
- train_vs_val:
    x: epoch
    y: loss
    data: [train_loss.csv, val_loss.csv]

- val_f1.csv:
    x: epoch
    y: [f1_class_0, f1_class_1]
    y_label: f1
    # data is absent, using key value: val_f1.csv

# Separately plot since we have TWO plots, even though with the same data file
- scores_acc:
    x: epoch
    y: acc
    data: scores.csv
- scores_auc:
    x: epoch
    y: auc
    data: scores.csv

shcheklein commented 3 years ago

For the record, we got one more request for this:

https://discord.com/channels/485586884165107732/841856466897469441/892320977323712553

Brief summary, read the whole post for the details:

Hi everyone! I searched a little but couldn't find anything so wanted to ask. I want to create a custom vega template to see multiple lines in a single graph but it seems that studio doesn't allow it. What I want to do is see both training and validation loss on a single graph. I've created a custom template and it both works on vega online editor and dvc plot show shows it properly. But studio appends two key to the template and it messes up the graph.

cagdasbas commented 3 years ago

Hi everyone! My team also used to see the losses of both training and test in the same graph for each iteration (or epoch). I did some tests with current releases, and I believe the problem is with the studio because running dvc plot show properly shows the plots on the local browser. Here is what I've done so far.

This is my custom template file:

multi_loss.json:
{
    "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
    "data": {
        "values": "<DVC_METRIC_DATA>"
    },
    "title": "<DVC_METRIC_TITLE>",
    "width": 300,
    "height": 300,
    "mark": {
        "type": "line",
        "point": {
            "filled": false,
            "fill": "white"
        }
    },
    "encoding": {
        "x": {
            "field": "<DVC_METRIC_X>",
            "type": "quantitative",
            "title": "<DVC_METRIC_X_LABEL>"
        },
        "y": {
            "field": "<DVC_METRIC_Y>",
            "type": "quantitative",
            "title": "<DVC_METRIC_Y_LABEL>",
            "scale": {
                "zero": false
            }
        },
        "color": {
            "field": "stage",
            "type": "nominal",
            "legend": {"disable": false},
            "scale": {}
        }
    }
}

This is my plot definition in dvc.yaml

      - plots/losses.csv:
          cache: false
          title: Train/Test losses
          template: multi_loss
          x: epoch
          y: loss

And this is my sample csv file:

stage,epoch,loss
train,1,4.7
train,2,3.5
train,3,2.2
train,4,2.1
train,5,1.1
train,6,1.0
train,7,0.4
test,1,14.7
test,2,13.5
test,3,12.2
test,4,12.1
test,5,11.1
test,6,11.0
test,7,8.4

This configuration shows the graph properly on both dvc plot show output and in vega editor:

However, the problem is, the studio wants to group the plots by revision and overrides two keys in the template. Here is what vega editor shows when I click on "Open in Vega Editor" from the studio:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "data": {
    "values": [
      {"loss": "4.7", "epoch": "1", "stage": "train", "rev": "ab8f6b3"},
      {"loss": "3.5", "epoch": "2", "stage": "train", "rev": "ab8f6b3"},
      {"loss": "2.2", "epoch": "3", "stage": "train", "rev": "ab8f6b3"},
      {"loss": "2.1", "epoch": "4", "stage": "train", "rev": "ab8f6b3"},
      {"loss": "1.1", "epoch": "5", "stage": "train", "rev": "ab8f6b3"},
      {"loss": "1.0", "epoch": "6", "stage": "train", "rev": "ab8f6b3"},
      {"loss": "0.4", "epoch": "7", "stage": "train", "rev": "ab8f6b3"},
      {"loss": "14.7", "epoch": "1", "stage": "test", "rev": "ab8f6b3"},
      {"loss": "13.5", "epoch": "2", "stage": "test", "rev": "ab8f6b3"},
      {"loss": "12.2", "epoch": "3", "stage": "test", "rev": "ab8f6b3"},
      {"loss": "12.1", "epoch": "4", "stage": "test", "rev": "ab8f6b3"},
      {"loss": "11.1", "epoch": "5", "stage": "test", "rev": "ab8f6b3"},
      {"loss": "11.0", "epoch": "6", "stage": "test", "rev": "ab8f6b3"},
      {"loss": "8.4", "epoch": "7", "stage": "test", "rev": "ab8f6b3"}
    ]
  },
  "title": "Train/Test losses",
  "width": "container",
  "height": 200,
  "mark": {"type": "line", "point": {"filled": false, "fill": "white"}},
  "encoding": {
    "x": {"field": "epoch", "type": "quantitative", "title": "epoch"},
    "y": {
      "field": "loss",
      "type": "quantitative",
      "title": "loss",
      "scale": {"zero": false}
    },
    "color": {
      "field": "stage",
      "type": "nominal",
      "legend": {"disable": true},
      "scale": {"domain": ["ab8f6b3"], "range": ["#13adc7"]}
    }
  },
  "padding": {"bottom": 5, "left": 5, "right": 5, "top": 5}
}

The studio plot is:

It overrides legend and scale keys in color section and because of that vega shows only ${stage}== "ab8f6b3". If I change the stage of some rows to ab8f6b3, vega plots those rows.

I think we need a way to tell the studio that I don't want to compare these plots between revisions.

dberenbaum commented 3 years ago

@cagdasbas Thanks for the detailed info! This is a nice way workaround for getting training and validation onto the same plot. We hope to make this easier than needing a custom template in the future, but glad dvc plots show is at least working for you.

If you need a quick fix, I think you could adjust your template to add a facet, like:

{
    "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
    "data": {
        "values": "<DVC_METRIC_DATA>"
    },
    "title": "<DVC_METRIC_TITLE>",
    "facet": {
        "field": "rev",
        "type": "nominal"
    },
    "spec": {
        "width": 300,
        "height": 300,
        "mark": {
            "type": "line",
            "point": {
                "filled": false,
                "fill": "white"
            }
        },
        "encoding": {
            "x": {
                "field": "<DVC_METRIC_X>",
                "type": "quantitative",
                "title": "<DVC_METRIC_X_LABEL>"
            },
            "y": {
                "field": "<DVC_METRIC_Y>",
                "type": "quantitative",
                "title": "<DVC_METRIC_Y_LABEL>",
                "scale": {
                    "zero": false
                }
            },
            "color": {
                "field": "stage",
                "type": "nominal",
                "legend": {"disable": false},
                "scale": {}
            }
        }
    }
}

jorgeorpinel commented 3 years ago

@jorgeorpinel do we officially support such scenario in DVC? If yes maybe we should add this to docs. all the data, i.e. all CSV columns, are passed to vega, so if you hardcode field names there then you can do anything

Sorry for a very late reply on that but I also have the impression it's curently possible with a custom template. Can you confirm @pared ? If so it would definitely be nice to have an advance example in https://dvc.org/doc/command-reference/plots if you guys want to contribute a draft! Probably not essential though, especially as this discussion is ongoing and there may be a better way in the near future.

Suor commented 3 years ago

@ssachkovskaya probably switching off color rewriting if the field there is not rev should work here. And probably a good idea overall. I.e. we don't mess with a template unless it is what we expect.

ssachkovskaya commented 3 years ago

@cagdasbas we have pushed a fix to Studio, so now you should be able to see plots with multiple metrics using the approach suggested by @dberenbaum (your workaround + facet). Hope this helps while we are working on this feature.

@Suor good suggestion, however I am not sure how we will merge plots from multiple selected commits if they don't have a rev field. Let's discuss it internally.

cagdasbas commented 3 years ago

Thanks everyone! I couldn't try the fix @dberenbaum suggested. I'll write back as soon as I can.

pared commented 3 years ago

@jorgeorpinel It seems to be possible, though I think that this won't be an issue after iterative/dvc#5980

shcheklein commented 2 years ago

One more request from the user:

I really enjoy using dvc, and for me, there is one thing that might improve its notoriety and its popularity in the community, and I really really want to know if it is already inside or in question as a dev improvement on the stack : You might wonder what is all about ! Take a look at this picture, it is a really common picture in ml, but not in dvc nor dvc studio, I guess. Will DVC plots command accept two columns against formally precision of labels and/or title ? Perhaps it is already possible, but not in the doc, I guess. Please leave me a comment.

https://discordapp.com/channels/485586884165107732/563406153334128681/927839356654346262

@dberenbaum @pared are there plans to implement the proposal?

dberenbaum commented 2 years ago

Yes, perfect timing since @pared and I were just discussing it when this user posted! @pared is starting work on it now and will share the plans with @Suor and @shcheklein soon. Let us know if there's anyone else to include for initial feedback.

tapadipti commented 2 years ago

@dberenbaum @pared could you also include me in the plan/discussion for this. Thanks.

pared commented 2 years ago

@tapadipti I am currently working on that, what would you like to know?

shcheklein commented 2 years ago

Update. It has been implemented on the DVC side and we are looking now into this on the Studio side.

For docs, see this please # Combine multiple data sources. here: https://dvc.org/doc/user-guide/visualizing-plots

ssachkovskaya commented 1 year ago

@dberenbaum @shcheklein @tapadipti Since now top-level plots are supported by Studio, can this issue be closed?

iterative / studio-support