iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.63k stars 1.18k forks source link

option `--always-changed` of `dvc run` #2531

Closed dashohoxha closed 3 years ago

dashohoxha commented 4 years ago

There is a discussion that illustrates the usage of --always-changed with this example:

dvc run -d ... -o s3://my-bucket/folder.parquet ./create_dir.sh
dvc run -d s3://my-bucket/folder.parquet -o marker --always-changed ./generate_effective_md5.sh
dvc run -d marker -o ... ./do_something_with_dir.sh

I was thinking if we could solve the same problem with a callback stage, like this:

dvc run -d ... -o s3://my-bucket/folder.parquet ./create_dir.sh
dvc run -o marker ./generate_effective_md5.sh
dvc run -d marker -o ... ./do_something_with_dir.sh

or like this:

dvc run -o marker './create_dir.sh && ./generate_effective_md5.sh'
dvc run -d marker -o ... ./do_something_with_dir.sh

It seems like these are equivalent to the first example.

Then I realized that maybe the option --always-changed is not needed at all, since in all the cases we can convert it to a callback stage that is equivalent. Indeed, if --always-changed ignores the dependencies, then just don't specify any dependencies to the stage, making it a callback stage, and it will always be executed.

I wonder if there are any cases when an --always-changed stage cannot be converted to an equivalent callback stage.

efiop commented 4 years ago

@dashohoxha In both cases the connection between create_dir and do_something_with_dir would be lost, as generate_effective_md5 does't have dependencies listed, so dvc won't be able to build a DAG involving all of these three stages. Plus --always-changed is a much better explicit way of setting callback stages than implicit no-dep logic that we have. Please see the discussion about it at https://github.com/iterative/dvc/issues/2378 .

Closing as it seems to be resolved. Please feel free to reopen if I've missed anything.

dashohoxha commented 4 years ago

In both cases the connection between create_dir and do_something_with_dir would be lost, as generate_effective_md5 does't have dependencies listed, so dvc won't be able to build a DAG involving all of these three stages.

What about this:

dvc run -d ... -o s3://my-bucket/folder.parquet -o marker './create_dir.sh && ./generate_effective_md5.sh'
dvc run -d s3://my-bucket/folder.parquet -d marker -o ... './do_something_with_dir.sh'

I think this is equivalent to the original example (and still doesn't need --always-changed):

dvc run -d ... -o s3://my-bucket/folder.parquet ./create_dir.sh
dvc run -d s3://my-bucket/folder.parquet -o marker --always-changed ./generate_effective_md5.sh
dvc run -d marker -o ... ./do_something_with_dir.sh

Plus --always-changed is a much better explicit way of setting callback stages than implicit no-dep logic that we have.

I don't see why this is the case. It seems more like a personal preference.

Please see the discussion about it at #2378 .

This is a mind-boggling discussion for me, I am not sure I can follow it. But it seems that there is no general agreement about that issue about what to do. So, I don't know why that issue is still open, and this one was closed immediately, without any proper discussions.

Please feel free to reopen if I've missed anything.

Maybe I would have done it, but I don't have permissions to reopen issues. Please feel free to leave it closed if you feel like this is not an important issue.

efiop commented 4 years ago

What about this:

It would always run do_something_with_dir if s3://my-bucket/folder.parquet has changed (hash computed by dvc has changed), which defeats the purpose.

I don't see why this is the case. It seems more like a personal preference.

Because the definition of "considered always changed if there are no deps" is implicit as it doesn't contain clear marker, and "considered always changed if always_changed == True" is an explicit one. Not sure how this is a personal preference.

This is a mind-boggling discussion for me, I am not sure I can follow it. But it seems that there is no general agreement about that issue about what to do. So, I don't know why that issue is still open, and this one was closed immediately, without any proper discussions. Maybe I would have done it, but I don't have permissions to reopen issues. Please feel free to leave it closed if you feel like this is not an important issue.

I guess I was wrong about this one being resolved. Reopening.

dashohoxha commented 4 years ago

It would always run do_something_with_dir if s3://my-bucket/folder.parquet has changed (hash computed by dvc has changed), which defeats the purpose.

Maybe this one is better:

dvc run -d ... -o s3://my-bucket/folder.parquet -o marker './create_dir.sh && ./generate_effective_md5.sh'
dvc run -d marker -o ... './do_something_with_dir.sh'

So, marker decides whether should be done something with the dir, and marker is computed as soon as the dir is created.

The point is that with callback stages it is possible to solve all the problems that can be solved with --always-execute stages. On the other hand I don't see how callback stages can be deprecated. If the user creates a stage without dependencies, are we going to complain that this is an error? It doesn't seem reasonable to me.

Because the definition of "considered always changed if there are no deps" is implicit as it doesn't contain clear marker, and "considered always changed if always_changed == True" is an explicit one.

Maybe we have a terminology problem. If we call a stage that has no dependencies "unconditioned" (or "unrestricted", or "unconstrained", or "independent", etc.) it would feel natural to execute its command all the times.

efiop commented 4 years ago

@dashohoxha Good point, that one would work, but at the expense of mashing stages together :)

The point is that with callback stages it is possible to solve all the problems that can be solved with --always-execute stages.

With callback stages and mashing commands together, yes :)

On the other hand I don't see how callback stages can be deprecated. If the user creates a stage without dependencies, are we going to complain that this is an error? It doesn't seem reasonable to me.

In 1.0 we'll consider dropping that behavior in favor of explicit --always-changed.

Maybe we have a terminology problem. If we call a stage that has no dependencies "unconditioned" (or "unrestricted", or "unconstrained", or "independent", etc.) it would feel natural to execute its command all the times.

That is a nice way to put it! But that is definitely not something very obvious, unlike dead simple --always-changed.

dashohoxha commented 4 years ago

Good point, that one would work, but at the expense of mashing stages together :)

Also at the benefit of using what is available :)

In 1.0 we'll consider dropping that behavior in favor of explicit --always-changed.

Do what you think is best. But I am in favor of not changing callback stages, except for their name (for example calling them "unconditioned" changes).

ghost commented 4 years ago

@dashohoxha concerns are valid.

Not sure how this is a personal preference.

@efiop , I think it is, since we don't have a design philosophy for DVC (at least, not a written one).

There are people that prefer a minimal design:

Others are on the opposite side, rooting for a monolithic tool to do everything;

@efiop , I'm still not sure what are the next steps for this issue, tho; agreeing that we are not going to deprecate --always-changed and close it?

mdekstrand commented 4 years ago

The Zen of Python, "explicit is better than implicit", is IMO a good argument for --always-changed and deprecating the implicit callback behavior.

However, to the root of this issue: the proposed solutions may solve most of the problems --always-changed does solve (I am not entirely sure), but not all the problems it was designed to solve (I've added a comment to #2378 with the results of actually using --always-changed for my use case, and finding a semantic hole that doesn't seem pluggable with callbacks).

Issue #2378 is about giving DVC the ability to check the status of files that are not entirely under its control. Integrating commands together to produce the marker file assumes that the only file changes we need to be able to detect are changes that are caused by a dvc run; however, if that is not the case (e.g. I'm running the repo against a different copy of the database, that has a different status of my data processing steps), I need the ability for DVC to detect that and rerun things appropriately.

Suor commented 4 years ago

I am also for removing --always-changed I don't see any use for it outside of strange optimization, i.e. we have dep, which is slow to test for change.

BTW,

./create_dir.sh && ./generate_effective_md5.sh

is not smashing stages together. This is a single stage, separating it is an error, which led to the original issue and confusion.

This is how dvc operates, outputs glue stages together. If some stage generates non-file, i.e. updates database state or something, then we simulate output with a marker, which is effectively an output from dvc POV.

So I am for removing --always-changed, forgetting the idea of callback deps and documenting the marker approach.

Suor commented 4 years ago

Thinking a bit more about it, we should handle a situation when a non-file dep (A) might be modified by some external activity to dvc, in this case we will have a stale marker, which will no longer correspond to non-file dep state.

The solution is a callback state updating a marker each time, then any stage dependent on A may still rely on depending on the marker.

There goes an order issue. Say A might be changed both in dvc process and externally, then we will have several stages:

An update A stage:

cmd: update_A && update_marker
outs: 
    path: marker
deps: maybe some

An A status stage:

cmd: update_marker
outs: 
    path: marker

And a dependent stage:

cmd: work_with_A
deps:
    path: marker

We have several issues here, which, I think we all can ignore:

The real reqs are always update the marker (to handle outside activity) and always update marker after updating A. Both are satisfied here.

The real issue, which we can't simply handle is reproducibility though. Say we want to rerun a dependent stage on an older A state, we can't, we only do that for files. We may discuss what should (or shouldn't) we do about that.

Suor commented 4 years ago

A general takeout from all of this - we should define where dvc responsibility ends, like say non-file non-dvc managed deps. Trying to stretch our responsibility over those limits means bringing together mutually contradictory objectives and won't lead to anything good.