iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.66k stars 1.17k forks source link

Release 3.0 #7093

Closed efiop closed 1 year ago

efiop commented 2 years ago
## Data Management
- [ ] https://github.com/iterative/dvc/issues/4658
- [x] #9531 (@efiop)
- [x] Rename `dvc add --jobs` to `--remote-jobs` (frequently gets confused with core.checksum_jobs)
- [ ] https://github.com/iterative/dvc/pull/9591
## Experiments and Pipelines
- [ ] #9221
- [x] add `dvc stage add --run`
- [x] [Remove Stage-level vars](https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#:~:text=Stage%2Dspecific%20values%20are%20also%20supported) (@skshetry)
- [ ] ~`dvc exp push` changes? (see https://iterativeai.slack.com/archives/C01116VLT7D/p1686570450222809?thread_ts=1686321601.633079&cid=C01116VLT7D)~ decided not to make breaking changes now; will revisit dropping `--rev HEAD`
## Deprecations
- [x] https://github.com/iterative/dvc/discussions/5940. (@skshetry )
- [x] Remove [~`exp gc`~](https://github.com/iterative/dvc/issues/7991), [~`exp init`~](https://github.com/iterative/dvc/issues/9267), [~`exp show --html`~](https://github.com/iterative/dvc/issues/9268), ~`exp show --pcp`~.
- [x] Deprecate `dvc add --recursive` flag.
- [x] Remove `dvc run`.
- [x] Remove `--show-json`/`--show-csv`/`--show-md` in favor of existing `--json`/`--csv`/`--md` flags.
- [x] ‍‍‍‍‍[Drop top-level plots definition as a dictionary, can be defined as a list instead](https://github.com/iterative/dvc/pull/8412).
- [x] Drop 1.0 `dvc.lock` format support.
- [x] Soft deprecate `desc`/`type`/`labels`/`meta` in `dvc.yaml` and `.dvc` files, now that we have artifacts (drop CLI flags, `dvc data ls`, and drop them from docs).
- [x] https://github.com/iterative/dvc/issues/9457 (and replace with `--message` short flag)
- [x] Deprecate dvc add --file (obscure and more likely to confuse than help)

Other release blockers

Studio Readiness for 3.0 release

dberenbaum commented 1 year ago
* [Remove Stage-level vars](https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#:~:text=Stage%2Dspecific%20values%20are%20also%20supported). I haven't seen this being used and makes implementation complex.

😅 https://discord.com/channels/485586884165107732/1111557378735865856/1111636564183883876

It's not clear to me whether the stage-level vars are needed there. @skshetry Maybe you have an idea for how else to handle that use case?

@skshetry It turns out they aren't needed in that case, so I'm on board to drop them.

aschuh-hf commented 1 year ago

Hi, I posted this question regarding external outputs on Discord and am moving the discussion here as suggested.

I currently moved a larger scale data analysis from our local server to run in AWS using a Ray Cluster (https://www.ray.io/). The different steps are implemented as individual Ray jobs which are compiled into a reproducible pipeline with DVC using a dvc.yaml. Because the execution of jobs / stages is in AWS, and because the size of the data is too large to store locally, the inputs and outputs of each stage are external AWS S3 URIs. As part of this I run into one or the other issue (stage output always "not in cache" even right after dvc commit -f was run; error to run stage the first time because S3 URI prefix does not exist) indicating some bug in DVC in regards to external outputs (still need to file GitHub issues for these).

However, looking at this roadmap for the next major release version 3.0, I can see that there is a To-Do item that reads "drop external outputs". So my more immediately pressing questions are:

An example dvc.yaml would look something like:

vars:
- s3_uri_prefix: s3://bucket/prefix
stages:
  stage_1:
    cmd: ray job submit --working-dir . -- python stage_1.py --output ${s3_uri_prefix}/stage_1
    deps:
    - stage_1.py
    outs:
    - ${s3_prefix}/stage_1:
        push: false
  stage_2:
    cmd: ray job submit --working-dir . -- python stage_2.py --input ${s3_uri_prefix}/stage_1 --output ${s3_uri_prefix}/stage_2
    deps:
    - stage_2.py
    - ${s3_uri_prefix}/stage_1
    outs:
    - ${s3_uri_prefix}/stage_2:
        push: false

I'd appreciate any pointers as to how DVC would still fit into the picture here.

dberenbaum commented 1 year ago

@aschuh-hf I'm personally torn on this decision, but ultimately (as you mentioned) there are a lot of bugs and problems with the way we handle external outputs today. The goal is not to give up on the use case you mention above (it's more important than ever), but to build a better solution for ti.

Besides staying on DVC 2.x for now, you should be able to continue working with pipelines streaming data to/from s3 if you set cache: false. Given that your data is large, do you mostly use dvc to manage your pipeline, or do you use the cache and checkout older versions of your data?

To cache your s3 pipeline outputs, I hope we can use cloud versioning to help address this, so please also comment there with your thoughts. This could mean that DVC no longer has to do slow operations like caching an extra copy of your data, so I hope it will end up improving the workflow in the end.

aschuh-hf commented 1 year ago

Thanks, @dberenbaum, for adding some more motivation for potentially deprecating it.

As a matter of fact, my current use case was for a data preprocessing and model evaluation pipeline. I am indeed mostly using the dvc.yaml in this context as a means to document the commands I run alongside the source code used to make it easier to reproduce the same results another time or re-running the same steps with different inputs. I don't for this purpose care too much about the caching of past results apart from the current run, and if I were to revert back to an earlier version I would be OK with spending the compute again to reproduce the outputs. Cloud versioning seems would fit also well in, though we haven't enabled this for the S3 bucket we are using for such experimental runs at the moment.

I can also confirm that the time spent on caching and re-computing hashes is quite a bottleneck given how slow it is for thousands of S3 objects. Given the use of a Ray cluster, the time waiting for DVC perform these operations is far greater than the actual execution of the individual stage commands.

It would mostly be sufficient for my use case here if outs and deps would merely define the dependency graph.

It would, however, also be useful if stages would still only re-run if the inputs, parameters, or command of a stage changed. If that can be done via version information provided by cloud versioning backends, that sounds good.

[...] you should be able to continue working with pipelines streaming data to/from s3 if you set cache: false.

Thanks for reminding me of the cache option, I hadn't thought about that I could use it to bypass the duplication of the S3 objects (given the s3cache is configured to be within the same S3 bucket as the stage outputs).

drop external outputs

I think when I read this, I mostly was concerned that S3 URIs would no longer be valid stage deps or outs. But if these are still supported that works for me.

efiop commented 1 year ago

One last thing we forgot to include is to stop validating cache on every minor operation. This is a leftover from the past where symlink/hardlink were default and you ran a greater risk to corrupt your cache. These days we default to reflink/copy, so there is no reason really to validate cache on every operation like checkout. We should instead only validate cache where we really need to (before push?) and just introduce dvc cache check (or something like that) for manual validation.

In practical terms, I don't think we document that certain operations check cache, so we are not really tied to 3.0 release and can just stop doing it whenever.

dberenbaum commented 1 year ago

In practical terms, I don't think we document that certain operations check cache, so we are not really tied to 3.0 release and can just stop doing it whenever.

Why don't we create a separate issue for it then and just make it high priority so we get to it soon?

skshetry commented 1 year ago

Stage-level vars was created to support loading different parameters for different stages. Take following as an example:

stages:
  train_model:
    foreach: [us, uk]
    do:
      wdir: ${item}
      vars: [params.yaml]  # uk/params.yaml, us/params.yaml, etc.
      cmd: cat params.yaml

But I don't find it very elegant, is cumbersome, has a lot of complexity and haven't seen anyone use this. Hydra support may have made this redundant.

sjawhar commented 1 year ago

Stage-level vars was created to support loading different parameters for different stages. Take following as an example:

stages:
  train_model:
    foreach: [us, uk]
    do:
      wdir: ${item}
      vars: [params.yaml]  # uk/params.yaml, us/params.yaml, etc.
      cmd: cat params.yaml

But I don't find it very elegant, is cumbersome, has a lot of complexity and haven't seen anyone use this. Hydra support may have made this redundant.

We actually use this feature extensively. Our piplines tend to have a lot of this:

stages:
  preprocess_feature_one:
    foreach: ${sessions}
    do:
      vars:
        - preprocessing_key: feature_one
          input_dir: load
      <<: &stage_preprocess
        cmd: >-
          python src/preprocess.py
          --params-file=params.yaml
          ${preprocessing_key}
          data/interim/${input_dir}/${key}
          data/interim/${preprocessing_key}/${key}
        params:
          - preprocess.${preprocessing_key}
        deps:
          - ../../poetry.lock
          - data/interim/${input_dir}/${key}
          - src/preprocess.py
        outs:
          - data/interim/${preprocessing_key}/${key}

  preprocess_feature_two:
    foreach: ${sessions}
    do:
      vars:
        - preprocessing_key: feature_two
          input_dir: feature_one
      <<: *stage_preprocess

While removing this functionality might not completely block us from upgrading to DVC 3.x—one could copy/paste the stage definition with small changes—it would definitely increase the maintenance burden of our pipelines and make mistakes much easier to make (oops, forgot to change X in all the stages). It's possible I'm overlooking some functionality made possible by hydra?

I might have even gloated about our use of this pattern in my last DVC office hours :sweat_smile:

efiop commented 1 year ago

@dberenbaum Created https://github.com/iterative/dvc/issues/9561

dberenbaum commented 1 year ago

Thanks @efiop! Does #9561 relate to 3.0 items?

dberenbaum commented 1 year ago

@sjawhar That's a really creative way to do things! If we were to implement #5172, do you think that would solve the same use case? From what I can tell, you are limited by needing to loop over both sessions and features.

@skshetry WDYT about the comment from @sjawhar above? What do you think would be the best way to do it?

efiop commented 1 year ago

@dberenbaum I will try to create dvc cache/remote check if I have time this week and if I will - it will be nice to drop cache checks in a few places right away. If not - i think we can do that later, even though that's not ideal.

skshetry commented 1 year ago

@sjawhar, that's an interesting way to do nested looping with merge-key.

How many sessions do you run? If they aren't that many, I'll suggest using a (duplicated) map/dictionary and use that loop.

# params.yaml
sessions:
  feature1_session1:
    key: feature_one
    input_dir: feature_one
  feature1_session2:
    key: feature_one
    input_dir: feature_one
  feature2_session1:
    key: feature_two
    input_dir: feature_two
  feature2_session2:
    key: feature_two
    input_dir: feature_two

This might be more readable, although you can still use the same merge-key to avoid duplications.

sessions:
  feature1_session1: &feature1
    key: feature_one
    input_dir: feature_one
  feature1_session2: *feature1
  ...
  feature2_session1: &feature2
    key: feature_two
    input_dir: feature_two
  feature2_session2: *feature2
  ...

5172 might be a better solution to fix this.

sjawhar commented 1 year ago

@sjawhar, that's an interesting way to do nested looping with merge-key.

How many sessions do you run? If they aren't that many, I'll suggest using a (duplicated) map/dictionary and use that loop.

@skshetry When we use DVC in a data registry pattern, there could be hundreds or thousands of sessions. Yes, though, you're right that we could have an additional step the one runs before running dvc commands which generates this params.yaml, but supporting it natively in DVC seems more logical (to me at least)