Closed dberenbaum closed 1 year ago
Added a commit to make importance.png
a regular stage output and mark it as a plot in the top-level plots
section instead. Not sure what is better to recommend here. It's more repetitive to have to put the path in both stages
and plots
, but it simplifies dvc stage add
since all plots output may be marked as regular outputs, and it keeps all plots together in one place.
Ready to review, but wait for https://github.com/iterative/dvc.org/pull/3691 and next DVC release to push to example-get-started.
Added a commit to make importance.png a regular stage output and mark it as a plot in the top-level plots section instead. Not sure what is better to recommend here. It's more repetitive to have to put the path in both stages and plots, but it simplifies dvc stage add since all plots output may be marked as regular outputs, and it keeps all plots together in one place.
Tbh, not sure this is that much justified here. Duplication to my mind is not needed here andstage add
is not that much simplified. Or do we plan to deprecate plots
arguments for the dvc stage add
?
Tbh, not sure this is that much justified here. Duplication to my mind is not needed here and
stage add
is not that much simplified. Or do we plan to deprecateplots
arguments for thedvc stage add
?
In this PR, I was focused on conceptually separating ideas in the tutorials. Here are the options:
dvc stage add
without plots (current PR):dvc stage add
without introducing plots or visualization:dvc stage add
...
--outs evaluation/importance.png \
--outs-no-cache evaluation/train/plots \
--outs-no-cache evaluation/test/plots \
...
dvc.yaml
to introduce plots/visualization:plots:
evaluation/importance.png:
ROC:
x: fpr
y:
evaluation/train/plots/roc.json: tpr
evaluation/test/plots/roc.json: tpr
Precision-Recall:
x: recall
y:
evaluation/train/plots/precision_recall.json: precision
evaluation/test/plots/precision_recall.json: precision
Confusion-Matrix:
template: confusion
x: actual
y:
evaluation/train/plots/confusion_matrix.json: predicted
evaluation/test/plots/confusion_matrix.json: predicted
This would enable leaving plots out of https://dvc.org/doc/start/data-management/metrics-parameters-plots and introducing them separately (not that we have to).
dvc stage add
with plotsdvc stage add
with some plots:dvc stage add
...
--plots evaluation/importance.png \
--outs-no-cache evaluation/train/plots \
--outs-no-cache evaluation/test/plots \
...
dvc.yaml
to introduce other plots:plots:
ROC:
x: fpr
y:
evaluation/train/plots/roc.json: tpr
evaluation/test/plots/roc.json: tpr
Precision-Recall:
x: recall
y:
evaluation/train/plots/precision_recall.json: precision
evaluation/test/plots/precision_recall.json: precision
Confusion-Matrix:
template: confusion
x: actual
y:
evaluation/train/plots/confusion_matrix.json: predicted
evaluation/test/plots/confusion_matrix.json: predicted
It may be harder to explain here why to do --plots evaluation/importance.png
but separately configure plots for evaluation/train/plots
and evaluation/test/plots
.
@dberenbaum sorry, may be I still don't understand the basics. Is it possible at all to do all the same as you in the global plots
section now as part of the stage, or the whole purpose of the global section is a superset now and pretty much should be the way to go, while pipelines should describe input / outputs (I guess I was asking the same question in the docs PR - it was not clear, and it's not clear if we should deprioritize / deprecate plot specs in within stages).
the whole purpose of the global section is a superset now
Yes, this is correct. Stage outputs must all be paths, but the key for the ROC
plot, for example, can't be a path because it reads data from multiple paths. Therefore, it can only be specified in the global section, which accepts keys that are not paths.
@dberenbaum makes sense, I would go with the first option then - no plots in stage add
. Overall it feels we should be deprecating it. Also worth making and explaining this distinction in docs if we keep both options for now.
so, it all looks good, right? we need to wait though before we launch this so that Studio and VS Code can render this, right?
so, it all looks good, right? we need to wait though before we launch this so that Studio and VS Code can render this, right?
Sure, let's wait before updating get started. I think we should merge https://github.com/iterative/dvc.org/pull/3691 as soon as it's ready.
I think this could be merged now?
Opened https://github.com/iterative/dvc.org/issues/4140. We can try to coordinate to merge the docs and repo changes together.
@shcheklein What do you think? Should we wait for https://github.com/iterative/studio-support/issues/64 and https://github.com/iterative/studio-support/issues/63 before merging?
@dberenbaum let's wait for a bit ...
@dberenbaum should be good to go now I think?
@shcheklein Maybe we can get https://github.com/iterative/studio/pull/4717 merged if it's almost ready?
@dberenbaum sure, let's ask how long will it take. I'm fine to wait for a day or two.
@dberenbaum are we good to go there?
Yup, looks like there was a Studio release today, so nothing is blocking. I'll go ahead, although I am getting some false conflicts in the PR, so I may have to close and reopen.
Need to wait for https://github.com/iterative/dvc/pull/7477. May also need/want to adjust syntax depending on final state of that core PR.
Train metrics/plots look bad and show that the model is overfitting. Adjusting
train.min_samples_split
to fractional float value (min_samples_split: 0.01 to split until <0.01 of data is in the leaf) make the results more typical.