Closed efiop closed 1 year ago
That's a great summary of the coming changes. I'd appreciate it if some of the items could be clarified.
- [ ] introduce
data
(name is subject to change) section in dvc.yaml for data (also need to look into imports)
Should we introduce similar for metrics
, plots
?
- default hash change.
It would be helpful if you could clarify this.
- [ ] deprecate
--show-*
flags- [ ] remove
--show-vega
flag
Where does this requirement come from? How users will generate png files from our templates in CI/CD?
Should we introduce similar for metrics, plots?
Sure, seems like plots will be the first. Just trying to mark this new approach with introducing separate sections.
It would be helpful if you could clarify this.
CC @pared
@dmpetrov
deprecate --show-* flags
I guess this is about migrating --show-json
and friends to --json
and similar, just to make it shorter.
remove --show-vega flag
We recently introduced '--json' for plots. It's not official yet, but since we support images, we will need to do it sooner or later (officially) in automation scenarios too. In that case --show-vega
does not make sense, because its limited to only one plots type. Results obtained by --show-vega
can be achieved by specifying singular target for --json
so there is not too much argument for keeping it.
@pared I thought --json
was a hidden option for VS Code?
For the vega example in https://cml.dev/doc/cml-with-dvc using vl2png, how can this be handled with --json
?
@pared I thought --json was a hidden option for VS Code?
Well, it seems to me that we will have to start supporting it. It kind of does not make sense from user perspective, to support vega as json and not do the same for images - our current problems "where to store the results" should probably not affect the enduser.
For the vega example in https://cml.dev/doc/cml-with-dvc using vl2png, how can this be handled with --json?
Some json manipulation is needed for that. In this particular example:
dvc plots diff \
--target classes.csv \
--template confusion \
-x actual \
-y predicted \
--json master | jq '."classes.csv" | .[0]' >> vega.json
For the vega example in https://cml.dev/doc/cml-with-dvc using vl2png, how can this be handled with --json?
Some json manipulation is needed for that. In this particular example:
dvc plots diff \ --target classes.csv \ --template confusion \ -x actual \ -y predicted \ --json master | jq '."classes.csv" | .[0]' >> vega.json
I think send-comment
with vega plot is a widely used example. Opened issue https://github.com/iterative/cml.dev/issues/172
I think send-comment with vega plot is a widely used example. Opened issue
Good idea though let's not solve it until we agree on whether --json
is what we want and that we remove --show-vega
.
Moved the plots discussion to https://github.com/iterative/dvc/discussions/7183.
One question here, because there might be some subrepo
s deep inside directories. Even if we had deprecated .dvc
files. We might still need to walk through the whole structure to find out all of the stages in subrepo
s. Or We will need to make a clear and better way to handle subrepo
s.
@karajan1001, we do not traverse into subrepos except for dvc list
. dvc.api
/dvc get
/dvc import
uses particular logic to find subrepos, but does not require it as we avoid getting files from subrepos. There have been discussions on getting rid of the "subrepos traversal" completely as it's not worth it with the complexity it introduces.
Additional items:
exp gc
(see #7991)dvc exp run --run-all/-j
flags and consider how to replace with top-level exp commands to start the queue (see https://github.com/iterative/dvc/discussions/8123)run
and replacing with dvc stage add --run
(see https://github.com/iterative/dvc/issues/5846).dvc/plots
as default location (no longer needed) (cc @pared)rev
field in Vega templates and rely on dvc_data_version_info
(cc @pared)Additional items:
@dberenbaum, can you please push these items/tasklist to the top comment? The item does not have to be a sure thing to be on top. :)
remove read-only support for dvc-1.0 lockfile and 1.0-lockfile-to-2.0 migrator. 1.0 lockfile was not versioned, 2.0 got schema keyword for version info. (Implementation, introduced in https://github.com/iterative/dvc/pull/5128)
Hopefully, we can remove this in 3.0.
If there’s no strong opinion, I’d like to deprecate —show-*
flags soon (let’s say in 2.35 for example). With deprecate, I mean that they will be hidden from CLI help. We don’t document them in docs for a long time now. We can remove them in 3.0.
start obsoleting data .dvc files
May I ask what this looks like? Meaning what replaces them; I remember discussing a new top-level data
section in dvc.yaml (can't find the issue now), is that still being considered?
In general though, I'm not sure I'm seeing huge changes that warrant a major semver update. IMO going to 3.0 is an opportunity for hyping and devrel activities (think Python 2 vs 3, how much anticipation there was, etc.) so it should be tied with something really big. I feel we're definitely in 2.0 ever since dvc.yaml pipelines were introduced, and maybe we could've gone 3.0 during the DVC Experiments release(s)... But since we didn't, should we wait for something on that scale?
since we didn't, should we wait for something on that scale?
These days, I believe DVC should be on calver. With that, we'll have a yearly cadence on release and we can hype things up. We can break compatibility on year++
release if we like.
Also we are at v2.30, I personally find it difficult to work with such numbers, at least with calver, I know when the feature was released. And for us, there is not much difference between minor/patch versions.
These days, I believe DVC should be on calver. With that, we'll have a yearly cadence on release and we can hype things up. We can break compatibility on year++ release if we like.
+1 for calver
May I ask what this looks like? Meaning what replaces them; I remember discussing a new top-level
data
section in dvc.yaml (can't find the issue now), is that still being considered?
I think that was the idea. I don't think it's likely to get rid of .dvc files anytime soon.
Not all of these ideas are likely to actually happen.
In general though, I'm not sure I'm seeing huge changes that warrant a major semver update. IMO going to 3.0 is an opportunity for hyping and devrel activities (think Python 2 vs 3, how much anticipation there was, etc.) so it should be tied with something really big. I feel we're definitely in 2.0 ever since dvc.yaml pipelines were introduced, and maybe we could've gone 3.0 during the DVC Experiments release(s)... But since we didn't, should we wait for something on that scale?
I'm using this ticket to track changes that we need to remember to make as part of the 3.0 release, rather than the drivers for that release or an indicator that it's coming soon (this issue has been around for almost a year). I think we do have big upcoming changes like cloud versioning and no-pipeline experiments that are enough to drive another major release, especially on top of all the changes to plots, queues, etc. that have been made since 2.0. I hope we can do a major release in early 2023.
These days, I believe DVC should be on calver. With that, we'll have a yearly cadence on release and we can hype things up. We can break compatibility on year++ release if we like.
If we use calver and we want to make a breaking change before the next year, what do we do?
If we use calver and we want to make a breaking change before the next year, what do we do?
Yeah, it is a bit limiting in that way. In practice, I don't think we'll break it that often.
Also, that's only applicable when we want to mix semver with calver. If we don't want to, any new version can be a breaking change (and that's left up to the user to determine).
We also haven't dropped .dvc file support for the pipeline stages, and we have kept feature parity between dvc.yaml and .dvc files.
They support params, metrics, plots (to some extent), all recent annotations features, desc, etc. It would be nice to remove support for them even if we keep .dvc
files for data management.
Plots: consider dropping rev field in Vega templates and rely on
dvc_data_version_info
(cc @pared)
@dberenbaum @daavoo @mattseddon do you have any insights on this? We use rev
, filename
, etc in group by and other parts of the Vega templates. If we nest them into dvc_data_version_info
we might need to then apply flatten
transformation in Vega templates to deal with values + quite a lot of changes to Studio, VS Code, etc. I wonder what was wrong with including filename
, rev
, field
as regular fields into datapoints?
@dberenbaum @daavoo @mattseddon do you have any insights on this?
We could take a step back and meet to define what format would be better for VSCode given today's features and needs.
The --json
flag is hidden, not documented, and most likely used only by VSCode. I don't think we need to wait for 3.0
to change the schema, we have already changed it in 2.X
releases.
@daavoo I think it was not about --json
but about getting rid of rev
field everywhere in favor of dvc_data_version_info
. We may have to dig through the discussions and PRs to determine why it was done this way unless @mattseddon remembers.
Also, unless it's causing some headaches now, I'm not sure it needs to be a priority to make and plots schema changes for 3.0.
We use
rev
,filename
, etc in group by and other parts of the Vega templates. If we nest them intodvc_data_version_info
we might need to then applyflatten
transformation in Vega templates to deal with values + quite a lot of changes to Studio, VS Code, etc. I wonder what was wrong with includingfilename
,rev
,field
as regular fields into datapoints?
@shcheklein Coming back to your question here, I think it was because we needed the existing rev
to be the concatenated value of everything to make it work with old plots templates that don't know about filename
or field
. Therefore, we couldn't use rev
for solely the revision info.
@dberenbaum yep, I see that we need more info. What I'm trying to understand is the difference in two approaches:
dvc_data_version_info
structure and keep expanding it (with filename
, rev
, etc, etc)With both approaches I believe we can achieve the same result, unless I missing something (?). But in case of introducing the additional layer of nesting that would complicate all the templates potentially (e.g. we would need to use transofrm
for flatten the fields to be able to group by, etc).
Downside of the second option - names collision - we are effectively making rev
, field
, filename
, etc reserved names. This still can and should be mitigated by prefixing them with something (__rev
or __dvc_rev
).
A breaking change related to the packaging that I want to do is move dvc.utils.pkg
to just dvc.pkg
or dvc.build
. That prevents us from importing dvc.utils
when we want to read just pkg
. Although this does not require 3.0.
(Side note: It'd be nice to figure out a way to set pkg
type without changing source code, as it does not make the build reproducible, i.e what I create locally is not going to be the same as what we upload for the same commit.)
Added an item to push/pull run-cache by default (don't see why we wouldn't do this).
Added two more items that might require some discussion:
dvc exp init
(it's outdated relative to the current recommended way to onboard to experiments)dvc exp show --html
(we should probably do this outside of the CLI if we want a parallel coordinates plot)Added for discussion:
- Drop External outputs
I think this needs a discussion like #9221 for what should replace it. I still see it being used frequently. For example, https://discord.com/channels/485586884165107732/1089975153854779412 and https://discord.com/channels/485586884165107732/1089910570888724520.
Added my thoughts on priorities in the checklist:
I took the liberty of cleaning up the checklist with tasks that we agree on. Other undecided tasks are still there on a collapsed section. If anyone feels like they should be part of 3.0, feel free to move them up.
Thanks @skshetry! I dropped items from TBD that seemed like we would obviously not get to, and then I moved a few down to TBD:
Please give your feedback on those and the other items in the TBD list so we can either prioritize them or drop them. Thanks!
It’s literally deleting just this file.
https://github.com/iterative/dvc/blob/37858be02fdef056801d917aa167ea300a3afa67/dvc/main.py#L1-L2
It’s already an alias to dvc.cli.main, and we are not using it internally. In the past, we have suggested to use main() as an API. It doesn’t strictly require 3.0, but I didn’t want to break.
EDIT: I removed it from the checklist, as it's internal API change, I don't think we have to write it down here.
I would prefer to drop the dos2unix behavior now
@pmrowla Moved it up.
https://github.com/iterative/dvc/issues/9272 - unclear to me if we still want to do this (@daavoo)
I think we should drop it in favor of proper built-in support for markdown output (https://github.com/iterative/dvc-render/issues/123)
@dmpetrov @shcheklein If you have any inputs on what should/shouldn't be included (for example, dropping checkpoints), please share.
@omesser, it'd be better to move the Studio issues to its own repository, since they are private and unrelated to dvc's release.
@skshetry they are already on studio, just referenced here to keep track. Those issues are in fact related to the dvc 3.0 release on the product side (3.0 release is more hollostic to the ecosystem and not just a github release on this repo. We will use this opportunity to create some PR material). I can create a placeholder umbrella issue there and dereference those, no objection, but all it will achieve is hide the issue titles from open source users reading this issue :man_shrugging: no problem, I will do that if you think it will be cleaner for this public facing issue. In any case the release material and messaging will refer to the larger DVC ecosystem's new capabilities including Studio features
EDIT: created https://github.com/iterative/studio/issues/5944 and using this one as reference here
The issue is more about changes that's going to happen for 3.0, let's keep it at that. The Studio issues are follow-ups, similar to many other issues that have to happen around or after 3.0 release like docs, release announcement, etc.
Working on 1.0 lock format deprecation.
The Studio issues are follow-ups, similar to many other issues that have to happen around or after 3.0 release like docs, release announcement, etc.
No they are part of the 3.0 public release (product release, not technical github release for the DVC repo). We can track them both and not have separate dvc-pre-requisite-release-checklist and dvc-product-release-checklist 🤔
Some possible ideas to add looking through dvc add
options:
dvc add --file
(obscure and more likely to confuse than help)dvc add --jobs
(frequently gets confused with core.checksum_jobs
)Two more proposals from my side:
dvc exp push
work by default, or remove the need for explicit remote
argument.
(If we make dvc exp push
work by default, what should it push?)* [Remove Stage-level vars](https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#:~:text=Stage%2Dspecific%20values%20are%20also%20supported). I haven't seen this being used and makes implementation complex.
😅 https://discord.com/channels/485586884165107732/1111557378735865856/1111636564183883876
It's not clear to me whether the stage-level vars are needed there. @skshetry Maybe you have an idea for how else to handle that use case?
* Make `dvc exp push` work by default, or remove the need for explicit `remote` argument. (If we make `dvc exp push` work by default, what should it push?)
I've come around to not minding the explicit remote
requirement since it has occasionally saved me from accidentally pushing to origin
when it was an upstream remote (like when I clone a demo project but don't want to mess with the studio demo). Especially now that we have VS Code to make pushing easier, I don't mind keeping the CLI explicit.
When dropping external outputs, can we continue to support --outs-no-cache
with external paths, and can we allow them without the --external
flag (which we can drop)?
Other release blockers
Studio Readiness for 3.0 release