Closed mmeendez8 closed 1 year ago
As far as I understand you may achieve it with custom templates within DVC. You'll need to write a little vega json or probably copy default one and add something there. Once you'll have it, assigned it to your data file with dvc plots modify --template
and saved it to git both dvc plots show
and Studio will show it like this. You might need to hardcode field names(s) for y axes into template though, so you custom template would be of a limited reuse.
@jorgeorpinel do we officially support such scenario in DVC? If yes maybe we should add this to docs.
Yes, you are definetely right @Suor. I was just wondering that if this is gonna be the main UI for DVC it might be necessary to find an easier way to achieve this target, specially if you pretend to move users from another popular tools as MLFlow that allow this task.
I can also add that using custom templates could end up in a lot of boilerplate too... you would have to move the same code over and over between repositories so it might be better to enable this feature here.
Plots in Studio are in an early phase now, we basically show whatever DVC shows. That is the question we haven't resolved yet how far do we want to move away from DVC and what types of things we should add here as opposed to both here and into DVC. Plots in DVC are also evolving.
Thinking of this, some template stored on Studio side - provided by platform or by user or generated via UI - linked to any CSV or JSON or other datafile is a valid use case on its own.
Plots in Studio are in an early phase now, we basically show whatever DVC shows. That is the question we haven't resolved yet how far do we want to move away from DVC and what types of things we should add here as opposed to both here and into DVC. Plots in DVC are also evolving.
I see, I was not conscious of this debate.
Thinking of this, some template stored on Studio side - provided by platform or by user or generated via UI - linked to any CSV or JSON or other datafile is a valid use case on its own.
Yes that would be a plausible solution for keeping studio and DVC "synchronized". Maybe it will make things a bit difficult in the future, I am thinking about the difficulty of handling all those automatically generated files for all plausible combinations... Anyway I get your point, it seems a larger discussion is needed here
This seems like a common enough case. Making it easier in dvc also makes sense, i.e. providing a more generic template and some extra dvc plots modify
options. What do you think @dberenbaum? Maybe even adding a possibility to use custom keys within plot props to be passed to template and rendered there - this is to enable users writing their own custom templates, which are generic/reusable.
Thanks @Suor. I don't think a user can do this in dvc now even with custom templates since only a single series is expected. It has come up before and makes sense as a useful feature.
There are a couple ways I can imagine achieving this within dvc:
dvc plots diff
between the two (or more) files. See https://github.com/iterative/dvc/discussions/5808 for a discussion/proposal on that.Both sound potentially useful. A couple of reasons I'd probably prioritize the first approach:
cc @pared
I don't think a user can do this in dvc now even with custom templates since only a single series is expected
As far as I can see all the data, i.e. all CSV columns, are passed to vega, so if you hardcode field names there then you can do anything. If data comes from several files then it's not possible though since data file being the plot is part of how even dvc.yaml
stores it. So option 2 is way easier to implement I believe.
We could do both or figure out what makes more sense for users. Do they want to be able to plot across files or within one file, and which is a better UI?
dvc plots diff --no-index model1_roc.tsv rev:model2_roc.tsv
.dvc plots modify -y train -y val
(feel free to suggest something different).@pared has suggested a syntax like dvc plots show -y file.csv -y rev1:file.csv -y rev2:file.csv
.
No obvious way to save this type of plot configuration for future use since config is currently tied to a path.
There is one obvious way - make a notion of a plot independent and make a separate entry in dvc.yaml
for plots, refer data files, templates and props there for each plot. This was briefly discussed when we implemented plots initially, but it was easier to implement attaching props to data files, also it was argued that that one is more intuitive and closer to how people operate.
No guarantees that the plots config (what if they specify different templates, x-axes, etc.?) or underlying data are compatible.
Same as now. We show error when trying to plot both in DVC and Studio. This is also complicated by the fact that data file might change over time, i.e. some columns may disappear or be renamed or props changed, which also means even if props and columns are consistent within a commit they might not be across them.
Unclear how to diff between revisions work if there are already multiple series on the plot.
We can use facets either by revision or by y
. Alternatively use different line styles. But we come into territory of combinatorial explosions, variability and user preferences here.
Any command line syntax solution, which avoids saving to dvc.yaml
, won't show up in Studio. This will mean we would need to invent out own UI and store things ourselves, while in command line people will need to use bash scripts or history to replot things.
Any command line syntax solution, which avoids saving to dvc.yaml, won't show up in Studio. This will mean we would need to invent out own UI and store things ourselves, while in command line people will need to use bash scripts or history to replot things.
Also, that does not seem to make too much sense from DVC perspective. I mean, thats the point of version control, to save things for later use.
The problem here is that on one hand, we would like DVC commands to provide tight integration with git and revisions, so that we can easily compare some assets (that was the initial driving force behind plots, and hence the behaviour of diffing only files with same name) and now we would like to compare different files from different revisions. The latter approach concept does not go well with the former.
make a notion of a plot independent and make a separate entry in dvc.yaml
If we want to satisfy both ideas, that seems to be the only way - maybe we should store just plot configuration and require user to provide data for particular revisions:files when they use plots?
now we would like to compare different files from different revisions
We may be getting ahead of ourselves here. I haven't yet heard of (nor can I think of) a use case where comparing different files from different revisions is actually needed. Doing one or the other may be sufficient.
make a notion of a plot independent and make a separate entry in dvc.yaml
Having a plots
section at the top level of dvc.yaml
might happen, but I think the keys are still likely to be file paths for now. If we want to fully decouple plots configuration from file paths altogether, I'm not sure exactly how that should look or whether it's worthwhile. It's probably a separate discussion that goes beyond combining metrics.
DVC could add support for both diffing between files and showing multi-column plots within a file.
# dvc.yaml
plots:
- train_loss.csv:
x: epoch
y: loss
- val_loss.csv:
x: epoch
y: loss
dvc plots diff --no-index train_loss.csv val_loss.csv
plots a diff just like comparing revisions.--no-index
is not an intuitive name, so open to other suggestions even though it would break git consistency.# dvc.yaml
plots:
- loss.csv:
template: multiline
x: epoch
y:
- train
- val
dvc plots show
plots both lines on the same plot using https://vega.github.io/vega-lite/docs/repeat.html.dvc plots diff
makes a facet grid of the plots (similar to confusion matrix).So I guess we are discussing here versatility vs user experience. We move targeting data from file_name to column_name. The question is whether there will come time when someone wants to compare val_loss
with train_loss
. Then we will be back to discussing very generic approach. Which now is not even dvc plot diff revision:file_path revision2:file_path
but even dvc plot revision:file_path:column revision2:file_path2:column2
For the record one more user was asking about this feature:
i want to plot loss and val_loss data in same graph on dvc studio. how do i command plots modify?
cc @dberenbaum - after the images if we have capacity, let's try to think together if we should improve this on the DVC side first or do a custom wizard (potentially with an ability to save its state back into repo) on the Studio side. I think there were some good suggestions on the DVC end and they didn't look too heavy.
Prioritizing this since, plots are p1
for us at the moment.
For reference, see this proposal from @Suor in https://github.com/iterative/dvc/discussions/5980#discussioncomment-1026072:
plots:
- train_vs_val:
x: epoch
y: loss
data: [train_loss.csv, val_loss.csv]
- val_f1.csv:
x: epoch
y: [f1_class_0, f1_class_1]
y_label: f1
# data is absent, using key value: val_f1.csv
# Separately plot since we have TWO plots, even though with the same data file
- scores_acc:
x: epoch
y: acc
data: scores.csv
- scores_auc:
x: epoch
y: auc
data: scores.csv
For the record, we got one more request for this:
https://discord.com/channels/485586884165107732/841856466897469441/892320977323712553
Brief summary, read the whole post for the details:
Hi everyone! I searched a little but couldn't find anything so wanted to ask. I want to create a custom vega template to see multiple lines in a single graph but it seems that studio doesn't allow it. What I want to do is see both training and validation loss on a single graph. I've created a custom template and it both works on vega online editor and dvc plot show shows it properly. But studio appends two key to the template and it messes up the graph.
Hi everyone! My team also used to see the losses of both training and test in the same graph for each iteration (or epoch). I did some tests with current releases, and I believe the problem is with the studio because running dvc plot show
properly shows the plots on the local browser. Here is what I've done so far.
This is my custom template file:
multi_loss.json:
{
"$schema": "https://vega.github.io/schema/vega-lite/v4.json",
"data": {
"values": "<DVC_METRIC_DATA>"
},
"title": "<DVC_METRIC_TITLE>",
"width": 300,
"height": 300,
"mark": {
"type": "line",
"point": {
"filled": false,
"fill": "white"
}
},
"encoding": {
"x": {
"field": "<DVC_METRIC_X>",
"type": "quantitative",
"title": "<DVC_METRIC_X_LABEL>"
},
"y": {
"field": "<DVC_METRIC_Y>",
"type": "quantitative",
"title": "<DVC_METRIC_Y_LABEL>",
"scale": {
"zero": false
}
},
"color": {
"field": "stage",
"type": "nominal",
"legend": {"disable": false},
"scale": {}
}
}
}
This is my plot definition in dvc.yaml
- plots/losses.csv:
cache: false
title: Train/Test losses
template: multi_loss
x: epoch
y: loss
And this is my sample csv file:
stage,epoch,loss
train,1,4.7
train,2,3.5
train,3,2.2
train,4,2.1
train,5,1.1
train,6,1.0
train,7,0.4
test,1,14.7
test,2,13.5
test,3,12.2
test,4,12.1
test,5,11.1
test,6,11.0
test,7,8.4
This configuration shows the graph properly on both dvc plot show
output and in vega editor:
However, the problem is, the studio wants to group the plots by revision and overrides two keys in the template. Here is what vega editor shows when I click on "Open in Vega Editor" from the studio:
{
"$schema": "https://vega.github.io/schema/vega-lite/v4.json",
"data": {
"values": [
{"loss": "4.7", "epoch": "1", "stage": "train", "rev": "ab8f6b3"},
{"loss": "3.5", "epoch": "2", "stage": "train", "rev": "ab8f6b3"},
{"loss": "2.2", "epoch": "3", "stage": "train", "rev": "ab8f6b3"},
{"loss": "2.1", "epoch": "4", "stage": "train", "rev": "ab8f6b3"},
{"loss": "1.1", "epoch": "5", "stage": "train", "rev": "ab8f6b3"},
{"loss": "1.0", "epoch": "6", "stage": "train", "rev": "ab8f6b3"},
{"loss": "0.4", "epoch": "7", "stage": "train", "rev": "ab8f6b3"},
{"loss": "14.7", "epoch": "1", "stage": "test", "rev": "ab8f6b3"},
{"loss": "13.5", "epoch": "2", "stage": "test", "rev": "ab8f6b3"},
{"loss": "12.2", "epoch": "3", "stage": "test", "rev": "ab8f6b3"},
{"loss": "12.1", "epoch": "4", "stage": "test", "rev": "ab8f6b3"},
{"loss": "11.1", "epoch": "5", "stage": "test", "rev": "ab8f6b3"},
{"loss": "11.0", "epoch": "6", "stage": "test", "rev": "ab8f6b3"},
{"loss": "8.4", "epoch": "7", "stage": "test", "rev": "ab8f6b3"}
]
},
"title": "Train/Test losses",
"width": "container",
"height": 200,
"mark": {"type": "line", "point": {"filled": false, "fill": "white"}},
"encoding": {
"x": {"field": "epoch", "type": "quantitative", "title": "epoch"},
"y": {
"field": "loss",
"type": "quantitative",
"title": "loss",
"scale": {"zero": false}
},
"color": {
"field": "stage",
"type": "nominal",
"legend": {"disable": true},
"scale": {"domain": ["ab8f6b3"], "range": ["#13adc7"]}
}
},
"padding": {"bottom": 5, "left": 5, "right": 5, "top": 5}
}
The studio plot is:
It overrides legend
and scale
keys in color
section and because of that vega shows only ${stage}== "ab8f6b3"
. If I change the stage of some rows to ab8f6b3
, vega plots those rows.
I think we need a way to tell the studio that I don't want to compare these plots between revisions.
@cagdasbas Thanks for the detailed info! This is a nice way workaround for getting training and validation onto the same plot. We hope to make this easier than needing a custom template in the future, but glad dvc plots show
is at least working for you.
If you need a quick fix, I think you could adjust your template to add a facet, like:
{
"$schema": "https://vega.github.io/schema/vega-lite/v4.json",
"data": {
"values": "<DVC_METRIC_DATA>"
},
"title": "<DVC_METRIC_TITLE>",
"facet": {
"field": "rev",
"type": "nominal"
},
"spec": {
"width": 300,
"height": 300,
"mark": {
"type": "line",
"point": {
"filled": false,
"fill": "white"
}
},
"encoding": {
"x": {
"field": "<DVC_METRIC_X>",
"type": "quantitative",
"title": "<DVC_METRIC_X_LABEL>"
},
"y": {
"field": "<DVC_METRIC_Y>",
"type": "quantitative",
"title": "<DVC_METRIC_Y_LABEL>",
"scale": {
"zero": false
}
},
"color": {
"field": "stage",
"type": "nominal",
"legend": {"disable": false},
"scale": {}
}
}
}
}
@jorgeorpinel do we officially support such scenario in DVC? If yes maybe we should add this to docs. all the data, i.e. all CSV columns, are passed to vega, so if you hardcode field names there then you can do anything
Sorry for a very late reply on that but I also have the impression it's curently possible with a custom template. Can you confirm @pared ? If so it would definitely be nice to have an advance example in https://dvc.org/doc/command-reference/plots if you guys want to contribute a draft! Probably not essential though, especially as this discussion is ongoing and there may be a better way in the near future.
@ssachkovskaya probably switching off color rewriting if the field there is not rev should work here. And probably a good idea overall. I.e. we don't mess with a template unless it is what we expect.
@cagdasbas we have pushed a fix to Studio, so now you should be able to see plots with multiple metrics using the approach suggested by @dberenbaum (your workaround + facet). Hope this helps while we are working on this feature.
@Suor good suggestion, however I am not sure how we will merge plots from multiple selected commits if they don't have a rev field. Let's discuss it internally.
Thanks everyone! I couldn't try the fix @dberenbaum suggested. I'll write back as soon as I can.
@jorgeorpinel It seems to be possible, though I think that this won't be an issue after iterative/dvc#5980
One more request from the user:
I really enjoy using dvc, and for me, there is one thing that might improve its notoriety and its popularity in the community, and I really really want to know if it is already inside or in question as a dev improvement on the stack : You might wonder what is all about ! Take a look at this picture, it is a really common picture in ml, but not in dvc nor dvc studio, I guess. Will DVC plots command accept two columns against formally precision of labels and/or title ? Perhaps it is already possible, but not in the doc, I guess. Please leave me a comment.
https://discordapp.com/channels/485586884165107732/563406153334128681/927839356654346262
@dberenbaum @pared are there plans to implement the proposal?
Yes, perfect timing since @pared and I were just discussing it when this user posted! @pared is starting work on it now and will share the plans with @Suor and @shcheklein soon. Let us know if there's anyone else to include for initial feedback.
@dberenbaum @pared could you also include me in the plan/discussion for this. Thanks.
@tapadipti I am currently working on that, what would you like to know?
Update. It has been implemented on the DVC side and we are looking now into this on the Studio side.
For docs, see this please # Combine multiple data sources.
here: https://dvc.org/doc/user-guide/visualizing-plots
@dberenbaum @shcheklein @tapadipti Since now top-level plots are supported by Studio, can this issue be closed?
It is always nice to be able to combine different plots from the same experiment. The classical example would be to compare train and validation loss or the relation between learning rate schedule changes and training loss