iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.66k stars 1.17k forks source link

Release 3.0 #7093

Closed efiop closed 1 year ago

efiop commented 2 years ago
## Data Management
- [ ] https://github.com/iterative/dvc/issues/4658
- [x] #9531 (@efiop)
- [x] Rename `dvc add --jobs` to `--remote-jobs` (frequently gets confused with core.checksum_jobs)
- [ ] https://github.com/iterative/dvc/pull/9591
## Experiments and Pipelines
- [ ] #9221
- [x] add `dvc stage add --run`
- [x] [Remove Stage-level vars](https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#:~:text=Stage%2Dspecific%20values%20are%20also%20supported) (@skshetry)
- [ ] ~`dvc exp push` changes? (see https://iterativeai.slack.com/archives/C01116VLT7D/p1686570450222809?thread_ts=1686321601.633079&cid=C01116VLT7D)~ decided not to make breaking changes now; will revisit dropping `--rev HEAD`
## Deprecations
- [x] https://github.com/iterative/dvc/discussions/5940. (@skshetry )
- [x] Remove [~`exp gc`~](https://github.com/iterative/dvc/issues/7991), [~`exp init`~](https://github.com/iterative/dvc/issues/9267), [~`exp show --html`~](https://github.com/iterative/dvc/issues/9268), ~`exp show --pcp`~.
- [x] Deprecate `dvc add --recursive` flag.
- [x] Remove `dvc run`.
- [x] Remove `--show-json`/`--show-csv`/`--show-md` in favor of existing `--json`/`--csv`/`--md` flags.
- [x] ‍‍‍‍‍[Drop top-level plots definition as a dictionary, can be defined as a list instead](https://github.com/iterative/dvc/pull/8412).
- [x] Drop 1.0 `dvc.lock` format support.
- [x] Soft deprecate `desc`/`type`/`labels`/`meta` in `dvc.yaml` and `.dvc` files, now that we have artifacts (drop CLI flags, `dvc data ls`, and drop them from docs).
- [x] https://github.com/iterative/dvc/issues/9457 (and replace with `--message` short flag)
- [x] Deprecate dvc add --file (obscure and more likely to confuse than help)

Other release blockers

Studio Readiness for 3.0 release

dmpetrov commented 2 years ago

That's a great summary of the coming changes. I'd appreciate it if some of the items could be clarified.

  • [ ] introduce data(name is subject to change) section in dvc.yaml for data (also need to look into imports)

Should we introduce similar for metrics, plots?

  • default hash change.

It would be helpful if you could clarify this.

  • [ ] deprecate --show-* flags
  • [ ] remove --show-vega flag

Where does this requirement come from? How users will generate png files from our templates in CI/CD?

efiop commented 2 years ago

Should we introduce similar for metrics, plots?

Sure, seems like plots will be the first. Just trying to mark this new approach with introducing separate sections.

It would be helpful if you could clarify this.

CC @pared

pared commented 2 years ago

@dmpetrov

deprecate --show-* flags

I guess this is about migrating --show-json and friends to --json and similar, just to make it shorter.

remove --show-vega flag

We recently introduced '--json' for plots. It's not official yet, but since we support images, we will need to do it sooner or later (officially) in automation scenarios too. In that case --show-vega does not make sense, because its limited to only one plots type. Results obtained by --show-vega can be achieved by specifying singular target for --json so there is not too much argument for keeping it.

dberenbaum commented 2 years ago

@pared I thought --json was a hidden option for VS Code?

For the vega example in https://cml.dev/doc/cml-with-dvc using vl2png, how can this be handled with --json?

pared commented 2 years ago

@pared I thought --json was a hidden option for VS Code?

Well, it seems to me that we will have to start supporting it. It kind of does not make sense from user perspective, to support vega as json and not do the same for images - our current problems "where to store the results" should probably not affect the enduser.

For the vega example in https://cml.dev/doc/cml-with-dvc using vl2png, how can this be handled with --json?

Some json manipulation is needed for that. In this particular example:

dvc plots diff \
            --target classes.csv \
            --template confusion \
            -x actual \
            -y predicted \
            --json master | jq '."classes.csv" | .[0]' >> vega.json
daavoo commented 2 years ago

For the vega example in https://cml.dev/doc/cml-with-dvc using vl2png, how can this be handled with --json?

Some json manipulation is needed for that. In this particular example:

dvc plots diff \
            --target classes.csv \
            --template confusion \
            -x actual \
            -y predicted \
            --json master | jq '."classes.csv" | .[0]' >> vega.json

I think send-comment with vega plot is a widely used example. Opened issue https://github.com/iterative/cml.dev/issues/172

pared commented 2 years ago

I think send-comment with vega plot is a widely used example. Opened issue

Good idea though let's not solve it until we agree on whether --json is what we want and that we remove --show-vega.

dberenbaum commented 2 years ago

Moved the plots discussion to https://github.com/iterative/dvc/discussions/7183.

karajan1001 commented 2 years ago

One question here, because there might be some subrepos deep inside directories. Even if we had deprecated .dvc files. We might still need to walk through the whole structure to find out all of the stages in subrepos. Or We will need to make a clear and better way to handle subrepos.

skshetry commented 2 years ago

@karajan1001, we do not traverse into subrepos except for dvc list. dvc.api/dvc get/dvc import uses particular logic to find subrepos, but does not require it as we avoid getting files from subrepos. There have been discussions on getting rid of the "subrepos traversal" completely as it's not worth it with the complexity it introduces.

dberenbaum commented 2 years ago

Additional items:

dtrifiro commented 1 year ago

Additional items:

dberenbaum commented 1 year ago
skshetry commented 1 year ago

@dberenbaum, can you please push these items/tasklist to the top comment? The item does not have to be a sure thing to be on top. :)

skshetry commented 1 year ago

remove read-only support for dvc-1.0 lockfile and 1.0-lockfile-to-2.0 migrator. 1.0 lockfile was not versioned, 2.0 got schema keyword for version info. (Implementation, introduced in https://github.com/iterative/dvc/pull/5128)

Hopefully, we can remove this in 3.0.

skshetry commented 1 year ago

If there’s no strong opinion, I’d like to deprecate —show-* flags soon (let’s say in 2.35 for example). With deprecate, I mean that they will be hidden from CLI help. We don’t document them in docs for a long time now. We can remove them in 3.0.

jorgeorpinel commented 1 year ago

start obsoleting data .dvc files

May I ask what this looks like? Meaning what replaces them; I remember discussing a new top-level data section in dvc.yaml (can't find the issue now), is that still being considered?

In general though, I'm not sure I'm seeing huge changes that warrant a major semver update. IMO going to 3.0 is an opportunity for hyping and devrel activities (think Python 2 vs 3, how much anticipation there was, etc.) so it should be tied with something really big. I feel we're definitely in 2.0 ever since dvc.yaml pipelines were introduced, and maybe we could've gone 3.0 during the DVC Experiments release(s)... But since we didn't, should we wait for something on that scale?

skshetry commented 1 year ago

since we didn't, should we wait for something on that scale?

These days, I believe DVC should be on calver. With that, we'll have a yearly cadence on release and we can hype things up. We can break compatibility on year++ release if we like.

Also we are at v2.30, I personally find it difficult to work with such numbers, at least with calver, I know when the feature was released. And for us, there is not much difference between minor/patch versions.

daavoo commented 1 year ago

These days, I believe DVC should be on calver. With that, we'll have a yearly cadence on release and we can hype things up. We can break compatibility on year++ release if we like.

+1 for calver

dberenbaum commented 1 year ago

May I ask what this looks like? Meaning what replaces them; I remember discussing a new top-level data section in dvc.yaml (can't find the issue now), is that still being considered?

I think that was the idea. I don't think it's likely to get rid of .dvc files anytime soon.

Not all of these ideas are likely to actually happen.

In general though, I'm not sure I'm seeing huge changes that warrant a major semver update. IMO going to 3.0 is an opportunity for hyping and devrel activities (think Python 2 vs 3, how much anticipation there was, etc.) so it should be tied with something really big. I feel we're definitely in 2.0 ever since dvc.yaml pipelines were introduced, and maybe we could've gone 3.0 during the DVC Experiments release(s)... But since we didn't, should we wait for something on that scale?

I'm using this ticket to track changes that we need to remember to make as part of the 3.0 release, rather than the drivers for that release or an indicator that it's coming soon (this issue has been around for almost a year). I think we do have big upcoming changes like cloud versioning and no-pipeline experiments that are enough to drive another major release, especially on top of all the changes to plots, queues, etc. that have been made since 2.0. I hope we can do a major release in early 2023.

dberenbaum commented 1 year ago

These days, I believe DVC should be on calver. With that, we'll have a yearly cadence on release and we can hype things up. We can break compatibility on year++ release if we like.

If we use calver and we want to make a breaking change before the next year, what do we do?

skshetry commented 1 year ago

If we use calver and we want to make a breaking change before the next year, what do we do?

Yeah, it is a bit limiting in that way. In practice, I don't think we'll break it that often.

Also, that's only applicable when we want to mix semver with calver. If we don't want to, any new version can be a breaking change (and that's left up to the user to determine).

skshetry commented 1 year ago

We also haven't dropped .dvc file support for the pipeline stages, and we have kept feature parity between dvc.yaml and .dvc files. They support params, metrics, plots (to some extent), all recent annotations features, desc, etc. It would be nice to remove support for them even if we keep .dvc files for data management.

shcheklein commented 1 year ago

Plots: consider dropping rev field in Vega templates and rely on dvc_data_version_info (cc @pared)

@dberenbaum @daavoo @mattseddon do you have any insights on this? We use rev, filename, etc in group by and other parts of the Vega templates. If we nest them into dvc_data_version_info we might need to then apply flatten transformation in Vega templates to deal with values + quite a lot of changes to Studio, VS Code, etc. I wonder what was wrong with including filename, rev, field as regular fields into datapoints?

daavoo commented 1 year ago

@dberenbaum @daavoo @mattseddon do you have any insights on this?

We could take a step back and meet to define what format would be better for VSCode given today's features and needs. The --json flag is hidden, not documented, and most likely used only by VSCode. I don't think we need to wait for 3.0 to change the schema, we have already changed it in 2.X releases.

dberenbaum commented 1 year ago

@daavoo I think it was not about --json but about getting rid of rev field everywhere in favor of dvc_data_version_info. We may have to dig through the discussions and PRs to determine why it was done this way unless @mattseddon remembers.

Also, unless it's causing some headaches now, I'm not sure it needs to be a priority to make and plots schema changes for 3.0.

dberenbaum commented 1 year ago

We use rev, filename, etc in group by and other parts of the Vega templates. If we nest them into dvc_data_version_info we might need to then apply flatten transformation in Vega templates to deal with values + quite a lot of changes to Studio, VS Code, etc. I wonder what was wrong with including filename, rev, field as regular fields into datapoints?

@shcheklein Coming back to your question here, I think it was because we needed the existing rev to be the concatenated value of everything to make it work with old plots templates that don't know about filename or field. Therefore, we couldn't use rev for solely the revision info.

shcheklein commented 1 year ago

@dberenbaum yep, I see that we need more info. What I'm trying to understand is the difference in two approaches:

  1. Introduce this nested dvc_data_version_info structure and keep expanding it (with filename, rev, etc, etc)
  2. Keep these all fields on the data point level.

With both approaches I believe we can achieve the same result, unless I missing something (?). But in case of introducing the additional layer of nesting that would complicate all the templates potentially (e.g. we would need to use transofrm for flatten the fields to be able to group by, etc).

Downside of the second option - names collision - we are effectively making rev, field, filename, etc reserved names. This still can and should be mitigated by prefixing them with something (__rev or __dvc_rev).

skshetry commented 1 year ago

A breaking change related to the packaging that I want to do is move dvc.utils.pkg to just dvc.pkg or dvc.build. That prevents us from importing dvc.utils when we want to read just pkg. Although this does not require 3.0.

(Side note: It'd be nice to figure out a way to set pkg type without changing source code, as it does not make the build reproducible, i.e what I create locally is not going to be the same as what we upload for the same commit.)

dberenbaum commented 1 year ago

Added an item to push/pull run-cache by default (don't see why we wouldn't do this).

dberenbaum commented 1 year ago

Added two more items that might require some discussion:

daavoo commented 1 year ago

Added for discussion:

dberenbaum commented 1 year ago
  • Drop External outputs

I think this needs a discussion like #9221 for what should replace it. I still see it being used frequently. For example, https://discord.com/channels/485586884165107732/1089975153854779412 and https://discord.com/channels/485586884165107732/1089910570888724520.

dberenbaum commented 1 year ago

Added my thoughts on priorities in the checklist:

skshetry commented 1 year ago

I took the liberty of cleaning up the checklist with tasks that we agree on. Other undecided tasks are still there on a collapsed section. If anyone feels like they should be part of 3.0, feel free to move them up.

dberenbaum commented 1 year ago

Thanks @skshetry! I dropped items from TBD that seemed like we would obviously not get to, and then I moved a few down to TBD:

Please give your feedback on those and the other items in the TBD list so we can either prioritize them or drop them. Thanks!

skshetry commented 1 year ago

It’s literally deleting just this file.

https://github.com/iterative/dvc/blob/37858be02fdef056801d917aa167ea300a3afa67/dvc/main.py#L1-L2

It’s already an alias to dvc.cli.main, and we are not using it internally. In the past, we have suggested to use main() as an API. It doesn’t strictly require 3.0, but I didn’t want to break.

EDIT: I removed it from the checklist, as it's internal API change, I don't think we have to write it down here.

pmrowla commented 1 year ago

I would prefer to drop the dos2unix behavior now

dberenbaum commented 1 year ago

@pmrowla Moved it up.

daavoo commented 1 year ago

https://github.com/iterative/dvc/issues/9272 - unclear to me if we still want to do this (@daavoo)

I think we should drop it in favor of proper built-in support for markdown output (https://github.com/iterative/dvc-render/issues/123)

dberenbaum commented 1 year ago

@dmpetrov @shcheklein If you have any inputs on what should/shouldn't be included (for example, dropping checkpoints), please share.

skshetry commented 1 year ago

@omesser, it'd be better to move the Studio issues to its own repository, since they are private and unrelated to dvc's release.

omesser commented 1 year ago

@skshetry they are already on studio, just referenced here to keep track. Those issues are in fact related to the dvc 3.0 release on the product side (3.0 release is more hollostic to the ecosystem and not just a github release on this repo. We will use this opportunity to create some PR material). I can create a placeholder umbrella issue there and dereference those, no objection, but all it will achieve is hide the issue titles from open source users reading this issue :man_shrugging: no problem, I will do that if you think it will be cleaner for this public facing issue. In any case the release material and messaging will refer to the larger DVC ecosystem's new capabilities including Studio features

EDIT: created https://github.com/iterative/studio/issues/5944 and using this one as reference here

skshetry commented 1 year ago

The issue is more about changes that's going to happen for 3.0, let's keep it at that. The Studio issues are follow-ups, similar to many other issues that have to happen around or after 3.0 release like docs, release announcement, etc.

skshetry commented 1 year ago

Working on 1.0 lock format deprecation.

omesser commented 1 year ago

The Studio issues are follow-ups, similar to many other issues that have to happen around or after 3.0 release like docs, release announcement, etc.

No they are part of the 3.0 public release (product release, not technical github release for the DVC repo). We can track them both and not have separate dvc-pre-requisite-release-checklist and dvc-product-release-checklist 🤔

dberenbaum commented 1 year ago

Some possible ideas to add looking through dvc add options:

skshetry commented 1 year ago

Two more proposals from my side:

dberenbaum commented 1 year ago
* [Remove Stage-level vars](https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#:~:text=Stage%2Dspecific%20values%20are%20also%20supported). I haven't seen this being used and makes implementation complex.

😅 https://discord.com/channels/485586884165107732/1111557378735865856/1111636564183883876

It's not clear to me whether the stage-level vars are needed there. @skshetry Maybe you have an idea for how else to handle that use case?

* Make `dvc exp push` work by default, or remove the need for explicit `remote` argument.
  (If we make `dvc exp push` work by default, what should it push?)

I've come around to not minding the explicit remote requirement since it has occasionally saved me from accidentally pushing to origin when it was an upstream remote (like when I clone a demo project but don't want to mess with the studio demo). Especially now that we have VS Code to make pushing easier, I don't mind keeping the CLI explicit.

dberenbaum commented 1 year ago

When dropping external outputs, can we continue to support --outs-no-cache with external paths, and can we allow them without the --external flag (which we can drop)?