Closed efiop closed 1 year ago
* [Remove Stage-level vars](https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#:~:text=Stage%2Dspecific%20values%20are%20also%20supported). I haven't seen this being used and makes implementation complex.
😅 https://discord.com/channels/485586884165107732/1111557378735865856/1111636564183883876
It's not clear to me whether the stage-level vars are needed there. @skshetry Maybe you have an idea for how else to handle that use case?
@skshetry It turns out they aren't needed in that case, so I'm on board to drop them.
Hi, I posted this question regarding external outputs on Discord and am moving the discussion here as suggested.
I currently moved a larger scale data analysis from our local server to run in AWS using a Ray Cluster (https://www.ray.io/). The different steps are implemented as individual Ray jobs which are compiled into a reproducible pipeline with DVC using a dvc.yaml
. Because the execution of jobs / stages is in AWS, and because the size of the data is too large to store locally, the inputs and outputs of each stage are external AWS S3 URIs. As part of this I run into one or the other issue (stage output always "not in cache" even right after dvc commit -f
was run; error to run stage the first time because S3 URI prefix does not exist) indicating some bug in DVC in regards to external outputs (still need to file GitHub issues for these).
However, looking at this roadmap for the next major release version 3.0, I can see that there is a To-Do item that reads "drop external outputs". So my more immediately pressing questions are:
An example dvc.yaml
would look something like:
vars:
- s3_uri_prefix: s3://bucket/prefix
stages:
stage_1:
cmd: ray job submit --working-dir . -- python stage_1.py --output ${s3_uri_prefix}/stage_1
deps:
- stage_1.py
outs:
- ${s3_prefix}/stage_1:
push: false
stage_2:
cmd: ray job submit --working-dir . -- python stage_2.py --input ${s3_uri_prefix}/stage_1 --output ${s3_uri_prefix}/stage_2
deps:
- stage_2.py
- ${s3_uri_prefix}/stage_1
outs:
- ${s3_uri_prefix}/stage_2:
push: false
I'd appreciate any pointers as to how DVC would still fit into the picture here.
@aschuh-hf I'm personally torn on this decision, but ultimately (as you mentioned) there are a lot of bugs and problems with the way we handle external outputs today. The goal is not to give up on the use case you mention above (it's more important than ever), but to build a better solution for ti.
Besides staying on DVC 2.x for now, you should be able to continue working with pipelines streaming data to/from s3 if you set cache: false
. Given that your data is large, do you mostly use dvc to manage your pipeline, or do you use the cache and checkout older versions of your data?
To cache your s3 pipeline outputs, I hope we can use cloud versioning to help address this, so please also comment there with your thoughts. This could mean that DVC no longer has to do slow operations like caching an extra copy of your data, so I hope it will end up improving the workflow in the end.
Thanks, @dberenbaum, for adding some more motivation for potentially deprecating it.
As a matter of fact, my current use case was for a data preprocessing and model evaluation pipeline. I am indeed mostly using the dvc.yaml
in this context as a means to document the commands I run alongside the source code used to make it easier to reproduce the same results another time or re-running the same steps with different inputs. I don't for this purpose care too much about the caching of past results apart from the current run, and if I were to revert back to an earlier version I would be OK with spending the compute again to reproduce the outputs. Cloud versioning seems would fit also well in, though we haven't enabled this for the S3 bucket we are using for such experimental runs at the moment.
I can also confirm that the time spent on caching and re-computing hashes is quite a bottleneck given how slow it is for thousands of S3 objects. Given the use of a Ray cluster, the time waiting for DVC perform these operations is far greater than the actual execution of the individual stage commands.
It would mostly be sufficient for my use case here if outs
and deps
would merely define the dependency graph.
It would, however, also be useful if stages would still only re-run if the inputs, parameters, or command of a stage changed. If that can be done via version information provided by cloud versioning backends, that sounds good.
[...] you should be able to continue working with pipelines streaming data to/from s3 if you set cache: false.
Thanks for reminding me of the cache
option, I hadn't thought about that I could use it to bypass the duplication of the S3 objects (given the s3cache
is configured to be within the same S3 bucket as the stage outputs).
drop external outputs
I think when I read this, I mostly was concerned that S3 URIs would no longer be valid stage deps
or outs
. But if these are still supported that works for me.
One last thing we forgot to include is to stop validating cache on every minor operation. This is a leftover from the past where symlink/hardlink were default and you ran a greater risk to corrupt your cache. These days we default to reflink/copy, so there is no reason really to validate cache on every operation like checkout
. We should instead only validate cache where we really need to (before push?) and just introduce dvc cache check
(or something like that) for manual validation.
In practical terms, I don't think we document that certain operations check cache, so we are not really tied to 3.0 release and can just stop doing it whenever.
In practical terms, I don't think we document that certain operations check cache, so we are not really tied to 3.0 release and can just stop doing it whenever.
Why don't we create a separate issue for it then and just make it high priority so we get to it soon?
Stage-level vars
was created to support loading different parameters for different stages. Take following as an example:
stages:
train_model:
foreach: [us, uk]
do:
wdir: ${item}
vars: [params.yaml] # uk/params.yaml, us/params.yaml, etc.
cmd: cat params.yaml
But I don't find it very elegant, is cumbersome, has a lot of complexity and haven't seen anyone use this. Hydra support may have made this redundant.
Stage-level
vars
was created to support loading different parameters for different stages. Take following as an example:stages: train_model: foreach: [us, uk] do: wdir: ${item} vars: [params.yaml] # uk/params.yaml, us/params.yaml, etc. cmd: cat params.yaml
But I don't find it very elegant, is cumbersome, has a lot of complexity and haven't seen anyone use this. Hydra support may have made this redundant.
We actually use this feature extensively. Our piplines tend to have a lot of this:
stages:
preprocess_feature_one:
foreach: ${sessions}
do:
vars:
- preprocessing_key: feature_one
input_dir: load
<<: &stage_preprocess
cmd: >-
python src/preprocess.py
--params-file=params.yaml
${preprocessing_key}
data/interim/${input_dir}/${key}
data/interim/${preprocessing_key}/${key}
params:
- preprocess.${preprocessing_key}
deps:
- ../../poetry.lock
- data/interim/${input_dir}/${key}
- src/preprocess.py
outs:
- data/interim/${preprocessing_key}/${key}
preprocess_feature_two:
foreach: ${sessions}
do:
vars:
- preprocessing_key: feature_two
input_dir: feature_one
<<: *stage_preprocess
While removing this functionality might not completely block us from upgrading to DVC 3.x—one could copy/paste the stage definition with small changes—it would definitely increase the maintenance burden of our pipelines and make mistakes much easier to make (oops, forgot to change X in all the stages). It's possible I'm overlooking some functionality made possible by hydra?
I might have even gloated about our use of this pattern in my last DVC office hours :sweat_smile:
@dberenbaum Created https://github.com/iterative/dvc/issues/9561
Thanks @efiop! Does #9561 relate to 3.0 items?
@sjawhar That's a really creative way to do things! If we were to implement #5172, do you think that would solve the same use case? From what I can tell, you are limited by needing to loop over both sessions and features.
@skshetry WDYT about the comment from @sjawhar above? What do you think would be the best way to do it?
@dberenbaum I will try to create dvc cache/remote check
if I have time this week and if I will - it will be nice to drop cache checks in a few places right away. If not - i think we can do that later, even though that's not ideal.
@sjawhar, that's an interesting way to do nested looping with merge-key.
How many sessions do you run? If they aren't that many, I'll suggest using a (duplicated) map/dictionary and use that loop.
# params.yaml
sessions:
feature1_session1:
key: feature_one
input_dir: feature_one
feature1_session2:
key: feature_one
input_dir: feature_one
feature2_session1:
key: feature_two
input_dir: feature_two
feature2_session2:
key: feature_two
input_dir: feature_two
This might be more readable, although you can still use the same merge-key to avoid duplications.
sessions:
feature1_session1: &feature1
key: feature_one
input_dir: feature_one
feature1_session2: *feature1
...
feature2_session1: &feature2
key: feature_two
input_dir: feature_two
feature2_session2: *feature2
...
@sjawhar, that's an interesting way to do nested looping with merge-key.
How many sessions do you run? If they aren't that many, I'll suggest using a (duplicated) map/dictionary and use that loop.
@skshetry When we use DVC in a data registry pattern, there could be hundreds or thousands of sessions. Yes, though, you're right that we could have an additional step the one runs before running dvc
commands which generates this params.yaml
, but supporting it natively in DVC seems more logical (to me at least)
Other release blockers
Studio Readiness for 3.0 release