Add support for optional outputs in TargetCollection

riga / law

Build large-scale task workflows: luigi + job submission + remote targets + environment sandboxing using Docker/Singularity

http://law.readthedocs.io

BSD 3-Clause "New" or "Revised" License

96 stars 39 forks source link

Add support for optional outputs in TargetCollection #139

Closed lmoureaux closed 1 year ago

lmoureaux commented 1 year ago

Workflow outputs are wrapped in a TargetCollection. When some of them are optional and missing, the workflow correctly reports that it's done if all non-optional outputs are present. However, trying to depend on such a workflow doesn't work: Luigi checks .exists() before it runs the dependent task, and dies with "initially missing tasks outputs". To avoid this issue, .exists() must ignore optional outputs as is done in this patch.

Ran into this when moving the JSON control files for a workflow, which are set as optional outputs by Law itself. (See #126.)

Didn't touch the file-based collections because I'm not sure how the threshold parameter should be handled (which sounds more likely to be used than with a mere heterogenous TargetCollection).

riga commented 1 year ago

Thanks @lmoureaux

This would indeed be nice (and consistent) to have! Checking this out now.

Btw, for these situations I added the check_unfulfilled_deps option to the luigi worker config.

lmoureaux commented 1 year ago

Btw, for these situations I added the check_unfulfilled_deps option to the luigi worker config.

Yes, I discovered this while debugging, but better fix the bug than work around it ^^

riga commented 1 year ago

I was able to look into it right away and noticed one thing. The decision of what to do when a target is optional should, for instance, still be made as task level (in the complete() check), rather than making it the default in the methods of the collection classes.

As an example, iter_existing should not skip optional targets by default, but the decision on how to treat optionals should be made elsewhere.

I create a PR (https://github.com/lmoureaux/law/pull/1) with the changes required to your branch so that it would be part of this PR. There I also propagated the new behavior to the file based collections, where it's quite complicated to make this fully efficient.

Could you give it a spin? If things work for you, we can merge :)

lmoureaux commented 1 year ago

The decision of what to do when a target is optional should, for instance, still be made as task level (in the complete() check), rather than making it the default in the methods of the collection classes.

Makes sense. Then one needs to update the workflow classes to consider the optional targets they inject (and only those!) as really optional. This is the primary issue I'm trying to solve here, not a full refactor of the collections :D

I think a task that wants custom exists behavior can simply subclass TargetCollection if it doesn't like the default behavior. TargetCollection should provide sane defaults, not try to be a catch-all (it's already too big with the threshold stuff).

As an example, iter_existing should not skip optional targets by default, but the decision on how to treat optionals should be made elsewhere.

Maybe the following definitions would be useful if you want to refactor:

Existing sub-targets are the ones for which exists() returns True.
Non-existing sub-targets are the ones for which exists() returns False.
Missing sub-targets are the non-existing sub-targets needed for the collection's exists() to return True. Without threshold, exists() is equivalent to not list(collection.iter_missing()).

Especially the distinction between missing and non-existing is new. Since it's a semantic difference, I think it should use a different function and not an optional parameter.

riga commented 1 year ago

Then one needs to update the workflow classes to consider the optional targets they inject (and only those!)

Do you mean those?

https://github.com/riga/law/blob/91af3fb9e64d631351585cc820f92915cd055117/law/workflow/remote.py#L482-L489

This is the primary issue I'm trying to solve here, not a full refactor of the collections :D

Sorry, I got a bit carried away when adding the changes on top :) I saw that they kinda had to be done and I want to make sure that this will still work for you.

Especially the distinction between missing and non-existing is new.

You're right, there would be a distinction between the two, although I would have to think again about implications (and perhaps the naming, because I believe one has to pay attention as a user when reading sth about "missing != non-existing :) ). Maybe sth for another PR?

lmoureaux commented 1 year ago

Then one needs to update the workflow classes to consider the optional targets they inject (and only those!)

Do you mean those?

https://github.com/riga/law/blob/91af3fb9e64d631351585cc820f92915cd055117/law/workflow/remote.py#L482-L489

Yes :) I'll see if I can update my branch to update these instead. Will also come up with simple repro steps. Is there a way to write unit tests?

This is the primary issue I'm trying to solve here, not a full refactor of the collections :D

Sorry, I got a bit carried away when adding the changes on top :) I saw that they kinda had to be done and I want to make sure that this will still work for you.

Hehe I can see that :) I'm not sure a refactor needs to be done. Optional targets can be explicitly unsupported by collections as long as Law doesn't use them this way.

Especially the distinction between missing and non-existing is new.

You're right, there would be a distinction between the two, although I would have to think again about implications (and perhaps the naming, because I believe one has to pay attention as a user when reading sth about "missing != non-existing :) ). Maybe sth for another PR?

I'd say something for an issue to discuss, then a PR to implement.

riga commented 1 year ago

I did a small-ish refactoring and updated https://github.com/lmoureaux/law/pull/1.

The main changes are:

Targets have a complete() now. It is similar to exists() but is meant to include other info and settings such as self.optional. Since it's being used from within the task's complete() method, I think complete is a good name here, too.
The exists(), count(), iter_missing() and iter_existing() methods in task collections are re-use a single internal method _iter_state() which implements the logic once. It has a flag optional_existing which can be
- None → no special treatment of optionals,
- True → treat optionals as existing, and
- False → treat optionals as missing.
With that, _iter_state() is flexible and yet optimized enough to serve as the basis for the four methods listed above.
Since the four methods forward all their additional arguments, all sub-class implementations could be (almost) fully removed, which heavily reduces the amount of code to be maintained.

I'm giving it a thorough testing later today, but wanted to share this with you early on. There are no full unit tests right now, but this is certainly planned for mid this year (but in principle they could already be added to tests/test_XYZ.py).