Closed dashohoxha closed 3 years ago
@dashohoxha In both cases the connection between create_dir
and do_something_with_dir
would be lost, as generate_effective_md5
does't have dependencies listed, so dvc won't be able to build a DAG involving all of these three stages. Plus --always-changed
is a much better explicit way of setting callback stages than implicit no-dep
logic that we have. Please see the discussion about it at https://github.com/iterative/dvc/issues/2378 .
Closing as it seems to be resolved. Please feel free to reopen if I've missed anything.
In both cases the connection between
create_dir
anddo_something_with_dir
would be lost, asgenerate_effective_md5
does't have dependencies listed, so dvc won't be able to build a DAG involving all of these three stages.
What about this:
dvc run -d ... -o s3://my-bucket/folder.parquet -o marker './create_dir.sh && ./generate_effective_md5.sh'
dvc run -d s3://my-bucket/folder.parquet -d marker -o ... './do_something_with_dir.sh'
I think this is equivalent to the original example (and still doesn't need --always-changed
):
dvc run -d ... -o s3://my-bucket/folder.parquet ./create_dir.sh
dvc run -d s3://my-bucket/folder.parquet -o marker --always-changed ./generate_effective_md5.sh
dvc run -d marker -o ... ./do_something_with_dir.sh
Plus
--always-changed
is a much better explicit way of setting callback stages than implicitno-dep
logic that we have.
I don't see why this is the case. It seems more like a personal preference.
Please see the discussion about it at #2378 .
This is a mind-boggling discussion for me, I am not sure I can follow it. But it seems that there is no general agreement about that issue about what to do. So, I don't know why that issue is still open, and this one was closed immediately, without any proper discussions.
Please feel free to reopen if I've missed anything.
Maybe I would have done it, but I don't have permissions to reopen issues. Please feel free to leave it closed if you feel like this is not an important issue.
What about this:
It would always run do_something_with_dir
if s3://my-bucket/folder.parquet
has changed (hash computed by dvc has changed), which defeats the purpose.
I don't see why this is the case. It seems more like a personal preference.
Because the definition of "considered always changed if there are no deps" is implicit as it doesn't contain clear marker, and "considered always changed if always_changed == True" is an explicit one. Not sure how this is a personal preference.
This is a mind-boggling discussion for me, I am not sure I can follow it. But it seems that there is no general agreement about that issue about what to do. So, I don't know why that issue is still open, and this one was closed immediately, without any proper discussions. Maybe I would have done it, but I don't have permissions to reopen issues. Please feel free to leave it closed if you feel like this is not an important issue.
I guess I was wrong about this one being resolved. Reopening.
It would always run
do_something_with_dir
ifs3://my-bucket/folder.parquet
has changed (hash computed by dvc has changed), which defeats the purpose.
Maybe this one is better:
dvc run -d ... -o s3://my-bucket/folder.parquet -o marker './create_dir.sh && ./generate_effective_md5.sh'
dvc run -d marker -o ... './do_something_with_dir.sh'
So, marker
decides whether should be done something with the dir, and marker
is computed as soon as the dir is created.
The point is that with callback stages it is possible to solve all the problems that can be solved with --always-execute
stages.
On the other hand I don't see how callback stages can be deprecated. If the user creates a stage without dependencies, are we going to complain that this is an error? It doesn't seem reasonable to me.
Because the definition of "considered always changed if there are no deps" is implicit as it doesn't contain clear marker, and "considered always changed if always_changed == True" is an explicit one.
Maybe we have a terminology problem. If we call a stage that has no dependencies "unconditioned" (or "unrestricted", or "unconstrained", or "independent", etc.) it would feel natural to execute its command all the times.
@dashohoxha Good point, that one would work, but at the expense of mashing stages together :)
The point is that with callback stages it is possible to solve all the problems that can be solved with --always-execute stages.
With callback stages and mashing commands together, yes :)
On the other hand I don't see how callback stages can be deprecated. If the user creates a stage without dependencies, are we going to complain that this is an error? It doesn't seem reasonable to me.
In 1.0 we'll consider dropping that behavior in favor of explicit --always-changed
.
Maybe we have a terminology problem. If we call a stage that has no dependencies "unconditioned" (or "unrestricted", or "unconstrained", or "independent", etc.) it would feel natural to execute its command all the times.
That is a nice way to put it! But that is definitely not something very obvious, unlike dead simple --always-changed.
Good point, that one would work, but at the expense of mashing stages together :)
Also at the benefit of using what is available :)
In 1.0 we'll consider dropping that behavior in favor of explicit --always-changed.
Do what you think is best. But I am in favor of not changing callback stages, except for their name (for example calling them "unconditioned" changes).
@dashohoxha concerns are valid.
Not sure how this is a personal preference.
@efiop , I think it is, since we don't have a design philosophy for DVC (at least, not a written one).
There are people that prefer a minimal design:
Others are on the opposite side, rooting for a monolithic tool to do everything;
@efiop , I'm still not sure what are the next steps for this issue, tho; agreeing that we are not going to deprecate --always-changed
and close it?
The Zen of Python, "explicit is better than implicit", is IMO a good argument for --always-changed
and deprecating the implicit callback behavior.
However, to the root of this issue: the proposed solutions may solve most of the problems --always-changed
does solve (I am not entirely sure), but not all the problems it was designed to solve (I've added a comment to #2378 with the results of actually using --always-changed
for my use case, and finding a semantic hole that doesn't seem pluggable with callbacks).
Issue #2378 is about giving DVC the ability to check the status of files that are not entirely under its control. Integrating commands together to produce the marker file assumes that the only file changes we need to be able to detect are changes that are caused by a dvc
run; however, if that is not the case (e.g. I'm running the repo against a different copy of the database, that has a different status of my data processing steps), I need the ability for DVC to detect that and rerun things appropriately.
I am also for removing --always-changed
I don't see any use for it outside of strange optimization, i.e. we have dep, which is slow to test for change.
BTW,
./create_dir.sh && ./generate_effective_md5.sh
is not smashing stages together. This is a single stage, separating it is an error, which led to the original issue and confusion.
This is how dvc operates, outputs glue stages together. If some stage generates non-file, i.e. updates database state or something, then we simulate output with a marker, which is effectively an output from dvc POV.
So I am for removing --always-changed
, forgetting the idea of callback deps and documenting the marker approach.
Thinking a bit more about it, we should handle a situation when a non-file dep (A) might be modified by some external activity to dvc, in this case we will have a stale marker, which will no longer correspond to non-file dep state.
The solution is a callback state updating a marker each time, then any stage dependent on A may still rely on depending on the marker.
There goes an order issue. Say A might be changed both in dvc process and externally, then we will have several stages:
An update A stage:
cmd: update_A && update_marker
outs:
path: marker
deps: maybe some
An A status stage:
cmd: update_marker
outs:
path: marker
And a dependent stage:
cmd: work_with_A
deps:
path: marker
We have several issues here, which, I think we all can ignore:
&&
to dedup marker creating code.The real reqs are always update the marker (to handle outside activity) and always update marker after updating A. Both are satisfied here.
The real issue, which we can't simply handle is reproducibility though. Say we want to rerun a dependent stage on an older A state, we can't, we only do that for files. We may discuss what should (or shouldn't) we do about that.
A general takeout from all of this - we should define where dvc responsibility ends, like say non-file non-dvc managed deps. Trying to stretch our responsibility over those limits means bringing together mutually contradictory objectives and won't lead to anything good.
There is a discussion that illustrates the usage of
--always-changed
with this example:I was thinking if we could solve the same problem with a callback stage, like this:
or like this:
It seems like these are equivalent to the first example.
Then I realized that maybe the option
--always-changed
is not needed at all, since in all the cases we can convert it to a callback stage that is equivalent. Indeed, if--always-changed
ignores the dependencies, then just don't specify any dependencies to the stage, making it a callback stage, and it will always be executed.I wonder if there are any cases when an
--always-changed
stage cannot be converted to an equivalent callback stage.