iterative / dvc.org

đź“– DVC website and documentation
https://dvc.org
Apache License 2.0
340 stars 394 forks source link

clarify pipeline stages vs experiments #3630

Closed casperdcl closed 1 year ago

casperdcl commented 2 years ago

Some features often underused/misunderstood/unknown could be helped by better docs/messaging/onboarding clarity.

Nothing in use-cases/experiment-tracking nor user-guide/experiment-management seems to tell existing dvc repro users why they should bother with/what are the use cases of dvc exp.

It doesn't seem clear to users what's the difference between stage/repro (i.e. pipelines) and exp (i.e. experiments).

jorgeorpinel commented 2 years ago

I think we're still waiting to see if repro is going to be deprecated in an upcoming release.

Rel https://github.com/iterative/dvc/issues/7866#issuecomment-1151842420

jorgeorpinel commented 2 years ago

Nothing in use-cases/experiment-tracking nor user-guide/experiment-management seems to tell existing dvc repro users why they should bother with/what are the use cases of dvc exp.

We do mention exp run vs. repro specifically in several places like https://dvc.org/doc/user-guide/experiment-management/experiments-overview#basic-workflow, https://dvc.org/doc/user-guide/experiment-management/running-experiments#running-the-pipelines, and https://dvc.org/doc/command-reference/exp/run.

casperdcl commented 2 years ago

None of those links make it remotely clear what the difference is.

The closest near-miss to being potentially helpful is:

đź“– dvc exp run is an experiment-specific alternative to dvc repro.

What are the use cases? When would you use one over another? Are there any examples? Does the description meaningfully reduce a confused user's frustration?

Related to https://stackoverflow.blog/2022/04/25/empathy-for-the-dev-avoiding-common-pitfalls-when-communicating-with-developers/

TL;DR:

very few users want to be using software. Instead, they want to do the things that software enables. [...] Users don’t want to buy your software, and they don’t want to read your documentation—they just want to have their problems solved

and http://mkremins.github.io/blog/doors-headaches-intellectual-need/

TL;DR:

A hammer (numerous dvc subcommands) seems pointless if you’ve never seen a nail (what are the different problems?)

shcheklein commented 2 years ago

I think, I missing the point of the question, or I also have some bias.

exp is captured repro. exp enables a higher lever use case of "experiments" on top of some low level building blocks like pipelines (including repro), etc. Do we need a separate command like dvc repro - I don't know. I don't like it personally "aesthetically" (that it's disconnected from dvc stage, that it overlaps with exp, etc). I also don't like dvc run that is hopefully will be replaced finally with dvc stage add. But it feels that some low level "make"-file like interface has its place.

Can I come up with a use case where dvc exp run won't solve the problem? Don't know tbh, feels like no, so again it will be only some aesthetics, or some edge cases. May be some automation, when it's clear that you don't want to deal with some overhead (no matter how small it is) of the dvc exp run. May be we can rename it to dvc stage run --all to make it cleaner.

Nothing in use-cases/experiment-tracking nor user-guide/experiment-management seems to tell existing dvc repro users why they should bother with/what are the use cases of dvc exp.

the whole point was not to complicate this and not bother users of dvc exp with low level details like dvc repro - why should they care? why do you think it's important for people who come to experiments to know about some strange alternative?

It doesn't seem clear to users what's the difference between stage/repro (i.e. pipelines) and exp (i.e. experiments).

as I mentioned, what you call pipelines is just one of the building blocks for experiments

Should there be a page clearly describing the difference between stages and experiements?

I can only see it from the perspective of a single command (repro vs exp run), what else? stage add does not compete at all with experiments.

jorgeorpinel commented 2 years ago

In case I wasn't clear earlier: I also wish this topic was clearer, but there's ambiguity in the product itself, and the docs are reflecting that. Deprecating repro or even exp is constantly chattered about, for example. @casperdcl do you have a suggestion on how to clarify this?

exp is captured repro low level "make"-file like interface

I like this. exp builds on top of repro and the latter becomes more of a "helper" (kind of how we expose fetch even when it's part of pull). Good notes for the cmd ref as @shcheklein points out.

why do you think it's important for people who come to experiments to know about some strange alternative?

Yes, we consciously decided not to do this. In fact we have a pending task to remove all or most "pipeline" info from https://dvc.org/doc/user-guide/experiment-management/running-experiments (see https://github.com/iterative/dvc.org/issues/2768).

casperdcl commented 2 years ago

CLI discussion at https://github.com/iterative/dvc/issues/7866 is a prerequisite to docs.

drozzy commented 2 years ago

These two clarification points I've found in various places (the latter one from @SoyGema) have been very useful for me as a user:

  1. Experiments commands exp produce a git ref, that is how it stores its state.
  2. "If you use dvc repro, each time you execute it will overwrite everything without going back unless you commit in between each execution." "dvc exp run allows to run different experiments, for example hyper parameter changes without having to create a commit for each one"
dberenbaum commented 1 year ago

Some additional feedback.

From @mvshmakov:

We’ve recently discovered that dvc repro is not really suitable for CI if the user wants live experiments in Studio to be enabled. As dvc repro does not create a new experiment, we don’t log params to the Studio, thus the experiment will be displayed only partially.

From https://discord.com/channels/485586884165107732/1065577177007018015/1065630078668648458:

I guess I was confused because when I checked the difference in docu, dvc exp run has the comment "Provides a way to execute and track experiments in your project without polluting it with unnecessary commits, branches, directories, etc." so I thought dvc exp is only "experimental" mode for stuff I don't want to have tracked (which I wanted). A remark about legacy in dvc run docs could be preventing further newbies like me asking stupid questions 🙂