Closed casperdcl closed 7 months ago
I would argue that dvc run
is not really useful, and I would just keep dvc stage add
.
Using CLI to define stages gets very cumbersome for all but most simple stages: I'm assuming most people just just use dvc run
(or stage add
) to generate the boilerplate/dvc.yaml for the stage and then manually edit it according to their needs.
For example, comparing dvc run
's and dvc stage add
's help, the only different flags that differ are the following (in run
):
--no-exec Only create dvc.yaml without actually running it.
--no-commit Don't put files/directories into cache.
--no-run-cache Execute the command even if this stage has already
been run with the same
command/dependencies/outputs/etc before.
My ideal workflow would be dvc stage add <name> <args>
(manually tweak if required), then run dvc repro <stage>
once satisfied.
To summarize:
dvc stage {add,list}
:heavy_check_mark: dvc repro
:heavy_check_mark: dvc run
❌I think this is mostly a duplicate of https://github.com/iterative/dvc/issues/5846
In the docs, dvc run
has already been removed almost everywhere in https://github.com/iterative/dvc.org/pull/3223.
I think this is mostly a duplicate of #5846
Mostly, although the suggested syntax is slightly different and means getting rid of repro
.
We could also keep it open or move it to a discussion to keep track of ideas for renaming subcommands generally. We are already planning to split dvc status
into dvc data status
and dvc stage status
. Many other commands already make sense as subcommands of dvc stage/data
or might benefit by being split.
I don't think we are ready to promise this yet, though, and it's unclear whether we would actually deprecate the existing commands, which could be painful for many existing users.
Using CLI to define stages gets very cumbersome for all but most simple stages: I'm assuming most people just just use
dvc run
(orstage add
) to generate the boilerplate/dvc.yaml for the stage and then manually edit it according to their needs.
The ideal workflow is probably to generate stages in VS Code with auto-completion (and maybe other helpers). I think it's already on the roadmap but not sure when it will happen. For now, dvc exp init
is probably the easiest way to generate a boilerplate stage.
If we migrate all of the existing command to an experiment
based workflow as well as dvc exp
related commands, I think we can deprecate the subcommand exp
then.
If we migrate all of the existing command to an experiment
based workflow as well as dvc exp
related commands, I think we can deprecate the syntax exp
then.
migrate all of the existing command to an experiment based workflow
Is there an issue or story about this?
Currently not.
@dtrifiro from this I understand you prefer repro
over run
and would like to (soft)deprecate the latter.
In that case, isn't dvc stage {add,list,repro}
still less confusing than the current dvc {stage {add,list},repro}
?
It would be consistent with dvc exp
interface.
dvc run
only executes one single stage, so it's more like dvc stage add ...
+ dvc repro --single-stage <name>
.
So it does not make sense to put that into dvc stage repro
. stage
is a lower level utility compared to a very high-level exp
command.
Why can't we have them both, dvc stage repro
to keep consistent with other stage commands and dvc repo
as a shortcut?
prefer repro over run and would like to (soft)deprecate the latter
run
is alerady soft-deprecated.
Why can't we have them both
Why do we need them both? repro
is a top level operation (not stage-specific) as @skshetry pointed out.
Otherwise, I think we should discuss the proposed stage run/repro
in the existing #5846, as @pmrowla pointed out.
It's not clear to me that #5846 is a solution. I'm asking about use cases rather than implementation/adding-even-more-confusingly-named-options.
A) What are the underlying concepts? afaik it's:
B) does the commandline interface map 1:1 to the above concepts? C) do the docs map 1:1?
Am I missing something? Need to sort out (A) before there's any hope of addressing (B) & (C).
The use case for dvc run
is for a when a user wants to generate or modify a stage in dvc.yaml
. This is redundant and confusing, which is why removing/deprecating it in favor of dvc stage add
is preferred. The use case for dvc run
is not for reproducing a single stage within an existing pipeline/dvc.yaml
.
The one thing that dvc run
does right now that dvc stage add
does not, is that dvc run
can actually execute the stage command once. This can be useful when generating stages because it makes DVC verify that all of the outputs you listed for the stage were actually generated, and that the command itself was executed properly (so run
provides a sanity check/validation)
To fill this use case, dvc stage add
just needs an extra flag to run the stage once, i.e dvc stage add --run
as described in #5846
A) What are the underlying concepts? afaik it's: 1. Pipelines, 2. Experiments B) does the commandline interface map 1:1 to the above concepts?
dvc repro
exists to do everything you have described in 1. Pipelines
dvc exp run
exists to do 2. Experiments
.This seems pretty clear-cut to me.
dvc stage ...
exists solely to provide a CLI interface for adding/modifying/removing stages inside dvc.yaml
files in the event that a user prefers using the CLI to do it instead of editing the yaml file themselves. It supplements both the "pipelines" and "experiments" use cases, since a user needs to generate dvc.yaml
files in both cases.
IMO this is the same as dvc remote ...
existing to add/remove/modify remote entries in a DVC configuration file, but dvc push/pull/fetch
existing separate from dvc remote ...
. The remote
commands are for configuration. push/pull/fetch
are for filling the use case of "store and retrieve files to/from cloud storage".
Likewise stage ...
provide configuration (of dvc.yaml
files). repro
and exp run
are for actually reproducing pipelines and conducting experiments.
So for clarification we want to support:
A)
If that is correct, then CLI suggestion:
B)
# TL;DR:
## 1. pipelines
dvc stage {add,list,rm,verify,repro} [stage_names... (default: all)]
## 2. experiments
dvc exp {add,list,rm,repro,stage} [exp_names... (default: all)]
dvc stage {add,list,rm}
: feature request dvc stage rm
dvc stage add --run
: feature request #5846dvc repro --verify
or dvc stage verify [--all]
or dvc stage repro --verify
: #5369dvc exp {list,rm,stage}
dvc exp apply
-> dvc exp stage
dvc exp repro
: renamed from dvc exp run
for consistencydvc exp add
: feature request replacing dvc exp run --queue
-> dvc exp add
feature request dvc stage rm
I do agree extracting the stage-related functions from dvc remove
could simplify the UX. This ticket I believe can be reduced to this request as the others are covered:
dvc stage add --run: feature request
https://github.com/iterative/dvc/issues/5846
feature request dvc exp rm
exp remove
exists.
feature request replacing dvc exp run --queue -> dvc exp add
Or maybe dvc queue add
? Please mention in https://github.com/iterative/dvc/issues/7592 / EP
After the discussions in https://github.com/iterative/dvc.org/pull/4460 about whether pipelines should be part of data management, experiments, or neither, I think we need to revisit this not only from a docs perspective, but also a product perspective (thanks to @casperdcl for repeatedly trying to move this forward here and in https://github.com/iterative/dvc.org/issues/3630).
DVC experiments were initially an extension of DVC pipelines, which likely led to a lot of this confusion. We have done a lot of work to separate experiments from pipelines and can now better reposition them.
In this context, exp run
is the only exp
command that relates to pipelines. We can move it outside of exp
and have it as the singular command to run a pipeline (either replacing repro
or making a new command like dvc stage run/repro
), with an option to run it like the existing repro
(don't save an experiment for those who want that).
Why it matters:
repro
vs exp run
confuses peopleexp run
(and its prominence in the docs/first exp command) makes experiments seem like an extension of pipelinesexp run
name makes people think it's not useful for non-experiment pipeline runsI'll suggest naming it simply dvc run
(in 3.0).
The first step could be adding a flag (--no-exp
?) to make the current dvc exp run
behave like repro
so that we have one consolidated command we could use when we are ready to deprecate the others.
I'll suggest naming it simply
dvc run
(in 3.0).
I like it and agree with having a top-level command, although the obvious downside is that it could be confusing to replace an existing command with a new one that has different functionality. WDYT about using some synonym like dvc exec
?
Reminder that we also need to update the landing page if we decide to add a top-level command because it currently shows dvc exp run
.
I'll suggest naming it simply dvc run (in 3.0).
My 2cs on this - I would still prefer dvc exp ..
since it makes it easier to connect with experiments. It looks strange that we would have dvc exp ...
family + one command that belongs to it but for some reason is outside.
🤔 I prefer a top-level command to exp run
for a few reasons:
repro
). I'd also say it biases users who start from the CLI interface to thinking experiments are about running pipelines and hides the dvclive-only functionality.dvc exp run
even though they aren't running ML experiments, or who prefer repro
solely because it sounds more like their use case.@dacbd You should also be aware that we are considering this change to rename.
I'd say it's mostly about pipelines
Yep, but what is the primary user scenario, high level case that we'll be explaining. E.g. can we start by writing a summary for this command that people, and ideally it should be in a way that people can understand (btw, we might realize that it's better to keep two commands still).
Yep, but what is the primary user scenario, high level case that we'll be explaining. E.g. can we start by writing a summary for this command that people, and ideally it should be in a way that people can understand (btw, we might realize that it's better to keep two commands still).
The high-level use case is running a data workflow. Maybe I want to include ML training and compare metrics at the end, but maybe not. Maybe I want to version my data so I can go back to any previous iteration, but maybe not. This is how I used it in the past, and even though the end result was an ML model, I only really cared about executing my pipeline steps in a make-like way.
Maybe I misunderstand you because I don't see how it is only about experiments from a high-level user scenario. I think at that point you could argue all of dvc belongs under exp because the entire product targets ml training scenarios. How is data management in DVC any more of a high-level case?
The high-level use case is running a data workflow. Maybe I want to include ML training and compare metrics at the end, but maybe not. Maybe I want to version my data so I can go back to any previous iteration, but maybe not. This is how I used it in the past, and even though the end result was an ML model, I only really cared about executing my pipeline steps in a make-like way.
Is it the way you write an intro for this new command? Can we actually try to draft the summary and description? I feel it can be complicated.
I understand where you are coming from I think. My concern that it can become too abstract for people. It's easier for me to think like this- we have a high level scenario - e.g. experiment tracking (versioning). If people come to DVC because of it we should make it simple to understand. In this case dvc exp run
it the simplest way, less confusing. The description is as simple as - it runs / creates a new experiment(s)
. If we start generalizing into "running a data workflow." it becomes too complicated to my mind.
I see that trying to nicely generalize and squeeze everything can be quite a hard task.
In case of DVC we could have done something like:
dvc exp run/show
dvc pipeline run/status
dvc data status ...
and it's should be clear that exp
set of commands are built on top of the other commands. If we want to actually emphasize a single scenario may be we should remove exp
from all commands above.
may be still keep data commands top level for historic reasons and since it's similar to Git and I don't see people being confused a lot with them.
wdyt?
Stepping back, here were my goals for prioritizing this:
Why it matters:
* `repro` vs `exp run` confuses people * `exp run` (and its prominence in the docs/first exp command) makes experiments seem like an extension of pipelines * `exp run` name makes people think it's not useful for non-experiment pipeline runs
I think the first point is the biggest pain point for users, and without it, I wouldn't prioritize this for 3.0 release. However, replacing the existing commands with a 3rd new command feels a bit like https://xkcd.com/927.
WDYT about adding a flag to exp run
to make it work like repro
(--no-exp
/--no-save
/--no-ref
) and then making repro
an alias of exp run
? That way it's at least simple to explain that they do the same thing, although I still think we will have to consider whether to emphasize one or the other in docs.
@shcheklein Responding to your comments below.
Is it the way you write an intro for this new command? Can we actually try to draft the summary and description? I feel it can be complicated.
I would say it like this: Run pipeline stages and by default save the results as an experiment
.
e.g. experiment tracking (versioning). If people come to DVC because of it we should make it simple to understand
If we want to actually emphasize a single scenario
Do we want to emphasize a single scenario? Some people come to dvc to run a multi-step data process first and experiment tracking is secondary or never needed. For people who come for experiment tracking first, they can get far with the other exp commands before needing to run a pipeline.
Run pipeline stages and by default save the results as an experiment
Yep, that already sounds a bit too complicated (you have to understand something about pipelines?). It contradicts a bit with For people who come for experiment tracking first, they can get far with the other exp commands before needing to run a pipeline.
?
Some people come to dvc to run a multi-step data process first and experiment tracking is secondary or never needed.
Yep. And that's why we introduced trails, etc. Still not sure that was the best decision though. It complicates everything a lot. You are right that people come for different things. But I feel it would be a mistake to try to make a single command that is so general in its description that encompasses all the scenarios at once.
bout adding a flag to exp run to make it work like repro (--no-exp/--no-save/--no-ref) and then making repro an alias of exp run
I'm fine with that. I would try to keep the description simple "Runs an experiment". (pipelines - it's an implementation detail - e.g.
"In order for DVC to know which exactly command to run you need to specify an entry point (stage) in DVC.yaml. It can be as simple as:
---> single stage cmd
pipeline goes here
But also DVC supports multiple stages, etc ... with such and such benefits ..."
All of this ^^ should be part of the Experiments "trail"/set of commands/use case. If people care only about pipelines, why won't we introduce dvc stage run
? or even dvc pipeline
subcommands? To make it explicit.
(again, not a blocker from my end - just sharing my thoughts. Priority for me would be to keep the happy path around each scenario as simple as possible. I would be carefully looking into how docs would look like - are they easy to read, etc, etc).
bout adding a flag to exp run to make it work like repro (--no-exp/--no-save/--no-ref) and then making repro an alias of exp run
I'm fine with that.
Okay, let's start with this.
We can discuss the rest for product/docs discussion but doesn't need to block this.
Yep, that already sounds a bit too complicated (you have to understand something about pipelines?). It contradicts a bit with
For people who come for experiment tracking first, they can get far with the other exp commands before needing to run a pipeline.
?
Not sure I follow how it contradicts. From what I can tell, the difference is that you think the target user for the command should be someone coming for experiment tracking and I think the target user should be someone coming for pipelines. Am I misunderstanding?
From what I can tell, the difference is that you think the target user for the command should be someone coming for experiment tracking and I think the target user should be someone coming for pipelines. Am I misunderstanding?
May be. Yes, I would be optimizing this command for a single audience. (and, yes probably experiments). If your idea was optimized for pipelines, then yes it sounds good.
Catering to "experiment trackers" -> leaving dvc exp run
and adding flags aliasing some functionalities to current repro (--no-exp/--no-save/--no-ref). Seems like there's agreement there (keeping complete functionality under exp subcommand 👍 )
Catering to "pipeliners":
dvc pipeline run
- this makes sense to me as the primary entry point. I think it's better than dvc stage run
and also renaming dvc stage add
-> dvc pipeline add-stage
or similar to consolidate under the same subcommand makes sense. stage
is not the focus of attention imo, the pipeline
is the object we operate on. "adding stage" is descriptive of the operation, so to speak.
Top level commands:
dvc repro
-> dvc run
- Is that the suggested "move"? maybe dvc run
is cleaner of the two (but no strong opinion). I see both as only nice-to-have (type less) and should alias to pipeline "core functionality" - dvc pipeline run
@dberenbaum - wdyt about 2 ? in scope for 3.0 ?
Thanks @omesser! I don't think 2 is critical, especially since it doesn't sound like a breaking change. I was thinking of limiting 3.0 scope to 1 and making dvc repro
an alias of dvc exp run
since those are both breaking changes and help avoid the confusion of what's different between the commands.
If we started from scratch, I agree dvc pipeline run
sounds good, but as discussed above, I'm not sure we have a good enough reason to add another alias. I would like to try to focus on docs targeted towards "pipeliners" first and let that help drive the next steps here. WDYT?
What's our goal here? Do we want to have the same behaviour as dvc repro
in dvc exp run --no-save
? Or, is it just --no-save
variant of dvc exp run
? Ideally, they should be the same, but there are going to be subtle behaviour differences.
@dberenbaum - to give context, I'm bringing this up because I'm working on the get-started docs at the moment, and dvc pipeline run
would be better for dvc repro
for the new data-pipelines track IMO.
That being said it's not critical, it would just make it more polished
A possible reason why some features might be underused is naming inconsistency.
dvc stage {add,list}
dvc repro
dvc run
surely should be unified as
dvc stage {add,list,run}
ordvc stage {add,list,repro}
? Could sanitising these CLI subcommands be part of the next major release?