iterative / example-repos-dev

Source code and generator scripts for example DVC projects
https://dvc.org/doc
21 stars 13 forks source link

Add flexible plots #117

Closed dberenbaum closed 1 year ago

dberenbaum commented 2 years ago

Need to wait for https://github.com/iterative/dvc/pull/7477. May also need/want to adjust syntax depending on final state of that core PR.

Train metrics/plots look bad and show that the model is overfitting. Adjusting train.min_samples_split to fractional float value (min_samples_split: 0.01 to split until <0.01 of data is in the leaf) make the results more typical.

dberenbaum commented 2 years ago

Added a commit to make importance.png a regular stage output and mark it as a plot in the top-level plots section instead. Not sure what is better to recommend here. It's more repetitive to have to put the path in both stages and plots, but it simplifies dvc stage add since all plots output may be marked as regular outputs, and it keeps all plots together in one place.

dberenbaum commented 2 years ago

Ready to review, but wait for https://github.com/iterative/dvc.org/pull/3691 and next DVC release to push to example-get-started.

shcheklein commented 2 years ago

Added a commit to make importance.png a regular stage output and mark it as a plot in the top-level plots section instead. Not sure what is better to recommend here. It's more repetitive to have to put the path in both stages and plots, but it simplifies dvc stage add since all plots output may be marked as regular outputs, and it keeps all plots together in one place.

Tbh, not sure this is that much justified here. Duplication to my mind is not needed here andstage add is not that much simplified. Or do we plan to deprecate plots arguments for the dvc stage add?

dberenbaum commented 2 years ago

Tbh, not sure this is that much justified here. Duplication to my mind is not needed here andstage add is not that much simplified. Or do we plan to deprecate plots arguments for the dvc stage add?

In this PR, I was focused on conceptually separating ideas in the tutorials. Here are the options:

dvc stage add without plots (current PR):

  1. Do dvc stage add without introducing plots or visualization:
dvc stage add
  ...
  --outs evaluation/importance.png \
  --outs-no-cache evaluation/train/plots \
  --outs-no-cache evaluation/test/plots \
  ...
  1. Add the following to dvc.yaml to introduce plots/visualization:
plots:
  evaluation/importance.png:
  ROC:
    x: fpr
    y:
      evaluation/train/plots/roc.json: tpr
      evaluation/test/plots/roc.json: tpr
  Precision-Recall:
    x: recall
    y:
      evaluation/train/plots/precision_recall.json: precision
      evaluation/test/plots/precision_recall.json: precision
  Confusion-Matrix:
    template: confusion
    x: actual
    y:
      evaluation/train/plots/confusion_matrix.json: predicted
      evaluation/test/plots/confusion_matrix.json: predicted

This would enable leaving plots out of https://dvc.org/doc/start/data-management/metrics-parameters-plots and introducing them separately (not that we have to).

dvc stage add with plots

  1. Do dvc stage add with some plots:
dvc stage add
  ...
  --plots evaluation/importance.png \
  --outs-no-cache evaluation/train/plots \
  --outs-no-cache evaluation/test/plots \
  ...
  1. Add the following to dvc.yaml to introduce other plots:
plots:
  ROC:
    x: fpr
    y:
      evaluation/train/plots/roc.json: tpr
      evaluation/test/plots/roc.json: tpr
  Precision-Recall:
    x: recall
    y:
      evaluation/train/plots/precision_recall.json: precision
      evaluation/test/plots/precision_recall.json: precision
  Confusion-Matrix:
    template: confusion
    x: actual
    y:
      evaluation/train/plots/confusion_matrix.json: predicted
      evaluation/test/plots/confusion_matrix.json: predicted

It may be harder to explain here why to do --plots evaluation/importance.png but separately configure plots for evaluation/train/plots and evaluation/test/plots.

shcheklein commented 2 years ago

@dberenbaum sorry, may be I still don't understand the basics. Is it possible at all to do all the same as you in the global plots section now as part of the stage, or the whole purpose of the global section is a superset now and pretty much should be the way to go, while pipelines should describe input / outputs (I guess I was asking the same question in the docs PR - it was not clear, and it's not clear if we should deprioritize / deprecate plot specs in within stages).

dberenbaum commented 2 years ago

the whole purpose of the global section is a superset now

Yes, this is correct. Stage outputs must all be paths, but the key for the ROC plot, for example, can't be a path because it reads data from multiple paths. Therefore, it can only be specified in the global section, which accepts keys that are not paths.

shcheklein commented 2 years ago

@dberenbaum makes sense, I would go with the first option then - no plots in stage add. Overall it feels we should be deprecating it. Also worth making and explaining this distinction in docs if we keep both options for now.

shcheklein commented 2 years ago

so, it all looks good, right? we need to wait though before we launch this so that Studio and VS Code can render this, right?

dberenbaum commented 2 years ago

so, it all looks good, right? we need to wait though before we launch this so that Studio and VS Code can render this, right?

Sure, let's wait before updating get started. I think we should merge https://github.com/iterative/dvc.org/pull/3691 as soon as it's ready.

daavoo commented 1 year ago

I think this could be merged now?

dberenbaum commented 1 year ago

Opened https://github.com/iterative/dvc.org/issues/4140. We can try to coordinate to merge the docs and repo changes together.

dberenbaum commented 1 year ago

@shcheklein What do you think? Should we wait for https://github.com/iterative/studio-support/issues/64 and https://github.com/iterative/studio-support/issues/63 before merging?

shcheklein commented 1 year ago

@dberenbaum let's wait for a bit ...

shcheklein commented 1 year ago

@dberenbaum should be good to go now I think?

dberenbaum commented 1 year ago

@shcheklein Maybe we can get https://github.com/iterative/studio/pull/4717 merged if it's almost ready?

shcheklein commented 1 year ago

@dberenbaum sure, let's ask how long will it take. I'm fine to wait for a day or two.

shcheklein commented 1 year ago

@dberenbaum are we good to go there?

dberenbaum commented 1 year ago

Yup, looks like there was a Studio release today, so nothing is blocking. I'll go ahead, although I am getting some false conflicts in the PR, so I may have to close and reopen.