pyinvoke / invoke

Pythonic task management & command execution.
http://pyinvoke.org
BSD 2-Clause "Simplified" License
4.36k stars 367 forks source link

Advanced task dependencies / caching #461

Open bitprophet opened 7 years ago

bitprophet commented 7 years ago

Use case description

Core desire is to strengthen relationships between tasks above/beyond what the existing pre/post functionality achieves; and to structure a nontrivial set of dependent state as a graph of tasks which only execute as many times as is required to set up that state.

Specific example is an existing nontrivial task tree around performing cloud automation, configuration management, and similar things. Many/most tasks in this tree are currently wrapped with a @state decorator whose body does a large amount of "heavy lifting":

At the least, this decorator "wants" to be split up into a bunch of smaller parts, each of which are connected by what state they themselves depend on, with the outermost/topmost caller (i.e. the decorated CLI task) specifying more precisely which bits they need. (This already exists in part with some args passed into the decorator, e.g. @state(limited=True) skips everything but the first few config bits and an early API client.)

Those smaller parts would then be capable of memoizing their results, so that e.g. if they are declared to "generate a value for config path c.aws.clients.vpc", and that config path already exists/is non-empty when something starts calling that function, it simply records this fact and short-circuits.

Going deeper?

Presumably one could expand this concept all the way to abstract things like "task A requires an admin cloud instance and a database" - including some sort of maybe registry-driven "find me some task that satisfies needs XYZ, cuz I need X Y and Z" setup. Which feels fraught and complex to me, and is probably a distinct API feel from the "do exactly what I say but feel free to skip anything you already did" described above.

Another approach could be something like @requires(file='/path/to/artifact', satisfied_by=build) where build is a task that presumably generates /path/to/artifact; this then becomes "just" a more rigorous @task(pre=[build]), and places the burden of the "is it already done?" test on the higher-level task instead of the lower-level one. Again, different API feel; and not the best, because if N tasks need /path/to/artifact they all have to specify this; and satisifed_by= implies that maybe M subtasks could generate the artifact, which feels specious.

Basically, each time I think of this side of it, I come back to: no, I really just want to specify that build_my_env depends on some half dozen other actual tasks, each of which wants to be memoizing or otherwise skipping already performed work. That gets us a very make-like setup, especially when you consider make is effectively just our existing pre-task functionality, except with the wrinkle that file paths "are" task names.

Solutions brainstorm

User-facing API

Note: was originally pondering additional decorators, but now think additional task kwargs makes more sense, see below comments. The name brainstorm is largely the same either way.

The actual needs are:

Specifying dependencies/prerequisites

Specifying followups/postrequisites

Skipping-execution checks

Jeff's thoughts

The hardest part by far seems to be the plural noun for tasks which come after the current one, so we should start there.

The only ones that have come up so far that aren't problematic are: "after-tasks", "consequences", "post-tasks", "followups", "postrequisites", "subsequents", "successors". None of these are immediate "yeah!"s so let's see how they stack up re: the criteria listed up top.

Implementation


Related, possibly subsumed tickets:

bitprophet commented 7 years ago

More thoughts...

Call caching/tracking

Any non-empirical task call tracking should be double-checked for eventual inter-process parallelization safety as well as intra-process parameterization safety:

Decorator API

I'm now rethinking the "separate decorator" API option, for a couple reasons:

bitprophet commented 7 years ago

Re: how to track the state: probably a good time to look at an actual graph lib. The one that Bruce identified and updated to be Python 3 compatible over in #45 is a good place to start, it's called dagger. Poking it briefly (including a skim of its docs):

bitprophet commented 7 years ago

So, after a source code skim, dagger is one of those classic "is it even worth saving myself <=375 SLOC if I am gonna feel compelled to tweak and/or subclass a bunch of stuff" gray area cases. Might as well use it to prototype for now but I reserve the right to just write an "inspired by" recreation if I start running into too much trouble.

bitprophet commented 7 years ago

Been wondering during all this if we can reasonably do away with post-tasks. They're a very minor use case compared to their inverse, have difficult naming, and don't fit neatly into the idea of a DAG: they're not part of the DAG at all but are part of the "body" of the top-level/being-invoked task, practically speaking. (And you can't phrase them as inverse dependencies, either, since the whole point is that they are not the focus of the execution.)

Further, most use cases for post-tasks seem like they hinge on success vs failure, which combined with the above, feels like they "should" be treated as part of #170 (tasks calling other tasks) and wrapped within try/except/finally blocks.


I scanned the tracker to see how many folks are filing tickets about post (vs pre-only) tasks and did find #298, so at least some users are using them - tho that is the only ticket I found.

It raises some things not noted prior:

The old deduplication was generally confusing re: how it treated post-tasks. tl;dr when post-tasks would show up multiple times, only the first one was kept, instead of only the last.

Corollary: even in a DAG, deduplication may want to be its own thing, because "lazy" or "elastic" pre/post tasks (what 'dependencies' generally mean - "run me at least once per session, anytime before/after the declaring task) are distinct from "rigid" ones (i.e. "run me once per declaring task" and typically also "run me immediately before/after the declaring task").

That said, I think it can still be argued that "rigid" pre/post tasks can/should be phrased within the body of the main task and don't benefit from the outer dependency/dedupe system (in fact they don't benefit from and significantly complicate such a system!)

bitprophet commented 7 years ago

If we assume we're gonna handle post-tasks, kind of want something that more closely matches the term dependency/depends/requires. Brainstorm:

indera commented 7 years ago

Travis uses after_success or after_failure ... https://docs.travis-ci.com/user/customizing-the-build

bitprophet commented 7 years ago

@indera Indeed, and the thought occurred to me, though I suspect in most cases folks will desire to handle the after_failure case 'internally' with try/except/finally type logic as opposed to reimplementing that functionality within the execution system. We'll see - it's certainly possible to add the success/failure split later.

bitprophet commented 7 years ago

A TODO for myself to get this off the ground and stop dithering:

bitprophet commented 7 years ago

Grump, I was all set on @task(triggers=[call, me, after]) but realized it's also ambiguous - I meant it as "Calling me triggers calls to these other tasks", but it could be read as a plural of "trigger", implying "other tasks which, when called, trigger a call to me".

Maybe afterwards is best for now after all? Though it (& the rest I brainstormed above) lacks a useful descriptive word / collective noun to go along with the keyword... I should do some Twitter polls or something :)

EDIT: going with "followup tasks" and @task(followups=[...]) for the time being.

bitprophet commented 7 years ago

More name ideas stemming from a (side) twitter discussion:

offbyone commented 7 years ago

dependencies/consumers is more explicit.

@task(and_then=CONSUMERS) works, as a documentation slug and name thingy.

(As an aside, the tendency of kwargs' names and storage to share a name in languages that support them really makes this kind of API a pain. Objective-C does this a lot better.)

bitprophet commented 7 years ago

consumers isn't necessarily generic enough, though; the example use cases that come up a lot are things like cleaning artifacts, notifying external services, etc. They aren't necessarily consuming something produced by the 'main' task (as opposed to e.g. a link-compile or static-asset pipeline where "consumer" is definitely appropriate.)

(One could make the argument that e.g. a notification followup task is "consuming" the [empty] output of e.g. a test or build task, but that feels like a stretch to me.)

dependencies seems like an obvious slam dunk either way though. I'm even contemplating adding an alias or two for it, e.g. @task(requires=[other, tasks]), but so far I've tried to shy away from having too many "convenient" aliases.

ask commented 7 years ago

Radical, but if it was like this you don't even have to define functions elsewhere:


@task()
def compile():
    ...

@compile.before
def check_versions():
    ...

@compile.after
def cleanup_files():
    ...
bitprophet commented 7 years ago

What if I want check_versions to run before some number of other tasks, instead of just @compile, though? :grin: Then we end up with this:

@task
def compile(): ...

@task
def build(): ...

@task
def dryrun(): ...

# ...

@compile.before
@build.before
@dryrun.before
# ...
def check_versions(): ...
ask commented 7 years ago

Yeah, I guess out of the question if these are supposed to build a tree of tasks. You can call other tasks in these, but it will be impossible to introspect what the dependencies are. The good news I may take use of this pattern some time :)

bitprophet commented 7 years ago

Also just not sure how I feel about task objects being usable as decorators; it definitely makes for a neat-looking API, but I worry it goes too far into the magic zone with not enough benefit. Just a gut feeling though. (EDIT: but yea, I can definitely see other use cases where the benefits do outweigh the drawbacks, so, good luck :D)

ask commented 7 years ago

@bitprophet The task objects are not decorators, that'd be the composite @task.after etc.

class Task:

   def __init__(self) -> None:
     self.before = Callbacks(before)
     self.after = Callbacks(after)

class Callbacks(MutableSequence, Callable):

    def __call__(self, fun: Callable) -> Callable:
        self.append(fun)
        return fun

I wouldn't call it magic exactly, implementation is simple, and It's used in the stdlib with @property:

class X:
    _foo = None
    @property
    def foo(self):
        return self._foo

    @foo.setter
    def foo(self, value):
        self._foo = value

    @foo.deleter
    def foo(self):
         print('OOPS')

(Alas, I cannot argue that @property is good use of it.)

bitprophet commented 7 years ago

Ah right, good point. (Sorry, bouncing all over the place right now so even more scatterbrained than usual.)

My other point still stands unfortunately, I think it makes more sense in the general case for the declarations to live in/on the tasks making them instead of vice versa. This overall problem space, of course, often sees solutions going both ways (see eg Chef resources' notifies/subscribes) but I think I'd prefer to implement the more commonly useful variant first, and leave the option on the table to add the inverse later if enough people seem to want it.

bitprophet commented 7 years ago

Note to self: while writing out multiple examples using @task(followups=[a,b,c]), suspect it should really be @task(followup=[a,b,c]) instead.

I also still don't hate @task(afterwards=[a,b,c]), while still probably referring to a, b and c as "followup tasks"?


Also noting FTR that Twitter has yielded more "literate" style ideas for the kwargs, e.g.:

My gut says these are slightly too cutesy, but you never know.

offbyone commented 7 years ago

Don't undersell it; there's definitely room for a sense of play in an API if the API is still usable. Especially if the literate approach yields clarity.

bitprophet commented 7 years ago

First pass at DDD is done, Github rendering of it is here: https://github.com/pyinvoke/invoke/blob/34c71cad54508579698bc5200dfaf3e65dd32eb5/sites/docs/concepts/execution.rst

Still says "followups" for now, I'll figure out what the final terminology should be after I actually prove an implementation works...

bitprophet commented 7 years ago

Each time I touch this stuff I find the API design parts irritating. Currently torn on whether to aim for multiple kwargs for the singular/plural case, or a single kwarg that behaves in a polymorphic fashion (accepts one object or an iterable of them).

E.g. @task(dependencies=[a,b,c]) + @task(depends_on=a), or...just @task(depends_on=[a,b,c]) + @task(depends_on=a). (I.e. in English, one can "depend on" a singular or plural noun, so...why not just go with that? Many more options here work in the "either-or" case than are purely singular or purely plural.)

Also still torn between depends_on and requires. (Either could work as a polymorphic kwarg.)

thebjorn commented 7 years ago

depends_on and requires read much better than dependencies (which is too long and has too many syllables).

Having an easy type signature will let IDEs help programmers spot errors, and it would prevent gymnastics if one of the required pre-requisites is an iterable, so I would urge always using a list, i.e. @task(requires=[a]).

afterwards doesn't seem too cutesy (or maybe on_success/on_error..? -- I don't recall if post_tasks run regardless of task success/failure).

bitprophet commented 7 years ago

Agree that shorter is better.

While I recognize your assertions about signature/gymnastics, I'm eternally on the fence about "always a list" because it's an annoying chore in the very common case of only ever wanting to throw a single value into it, and what's this sort of code for if not removing annoying chores? :grin: But again, I recognize the issues with polymorphism (or w/e it's properly called) which is why I wonder if there are any useful terms that strongly imply only the singular. Can't think of any really.

Then again - because we already do some mild gymnastics for the positional-argument use case, that makes 'single dependency' trivially easy (@task(my_dependency)) so perhaps it's moot. Not sure; explicit kwargs read nicely even in the trivial case.

Re: success vs error, my guiding principle right now is "if a given logic doesn't require the dependency system to achieve, it should not be implemented in that system at all", and I can't think of many scenarios where an on_error makes more sense than some in-task try/except/else/finally construct. Perhaps a "notify on fail", but even that is easily accomplished with a try/except that sets some config state, plus a regular followup task that interprets said state to figure out how/whether to notify. (And a "naive" notification task that isn't relying on any sort of state-passing, doesn't seem like it's very useful.)

"Normal" dependencies & followups make sense because it's not possible (certainly not easy) to achieve dependency deduplication or followup deferment without a call graph system; but anything where your logic dictates you want something to always happen, and immediately before/after the main task, seems like it should by rights live within the task body.

As always, though, this is an incremental change and I'm open to future changes/expansions.

bitprophet commented 7 years ago

Hrm. How about enables for followup tasks (@task(requires=[clean], enables=[notify]))? Not great, since "you can do X after you run me" is very different from "you must do X after you run me". But it's another one for the brainstorm pile, and it seems to match up well with requires at least.

Also, follow_with or followed_by might work, they're in the 3-syllable camp, are highly unambiguous in terms of subject/object, and feel less awkward than followups=.

offbyone commented 7 years ago

enables is an ordering constraint, but doesn't state that they will be executed after. It'd be useful for a statement that those tasks must follow this one, but not that they will.

bitprophet commented 7 years ago

Isn't that what I said? :D

bitprophet commented 7 years ago

Moar: while they feel too mouthy/awkward, specifying dependencies with preceded_by and followups with succeeded_by at least has the property of being symmetrical:

@task(preceded_by=[clean], succeeded_by=[notify])

Yea...definitely awkward to type.

Alternately, flip it around and specify dependencies with succeeds and followups with precedes? Bit easier to type (if still having the double-c and double-e. boy we're in the weeds now aren't we?)

@task(succeeds=[clean], precedes=[notify])
bitprophet commented 7 years ago

Spent the time to gather up all of the stupid words we've all brainstormed, and my personal thoughts/observations on 'em, and put them in the description in alpha order for shits n giggles. Please ping me if I missed your favorite 😛 EDIT: also tried to sprinkle in the plural-noun and verb angles where appropriate.

bitprophet commented 7 years ago

The more I stare at and/or enhance these lists the more I feel like a) the main limiter is the "tasks after this task" plural noun, everything else has half decent options, and b) the least-awful of those is still "followups". Further, I like the kwarg afterwards (esp vs followups) enough that I think it's worth the slight overhead of not being exactly the same as the plural noun.

So I guess I'll keep rolling with that for now, and also keep referring to the overall system as "dependencies" or "the dependency system" since I still suspect dependencies will always be the focus by far. Referring to everything as "dependencies/followups" feels needlessly strict.\

EDIT: also going with just-takes-iterables for both kwargs for the time being. Can always add in the bit of handwaving required for single objects later if it really bugs me 😜

bitprophet commented 5 years ago

Taking a stab at resurrecting this in a post-1.0 world. Sadly it means some of the "clean" changes now have to content with backwards compatibility, though I think that only really means keeping some arg/flag aliases around.

My old branch is nowhere near a clean merge due to a lot of the cleanup, file renaming & file consolidation that happened in the last year and a quarter (uggh) so I'm gonna have to do a lot of copy-pasting into a new branch or something. What needs doing:

Final TODO:

bitprophet commented 5 years ago

In re-reading the docs I'm finding the hard requirement of an iterable value for depends_on/afterwards a little annoying/weird, especially since the singular check exists. Don't see that it'll make life that much harder to allow the former two to be iterable-or-callable; esp since they don't (currently) accept strings (though we probably want to pre-emptively guard against that anyways).

Kinda wish check/checks had a nicer single name we could do the same for, since then it feels different from its siblings. Might have to double check my old brainstorm. EDIT: those options are all bleh. New ideas: skip_if, guard/guards (same problem as check/checks tho), requires (too easy to confuse with the actual dependencies).

Alternately, remove 'checks' entirely in favor of some method of transmitting a "full stop" from a dependency; except that feels too convoluted, especially in the sense that we may well have a tree of tasks where we only want some subtrees to skip execution, not the entire session. Also may put too much control into the dependency instead of the dependent.

Could just say checks by itself, no check, since you could interpret it as "this task checks to see if it needs to run, via this callable/these callables" and then it works for singular or plural. But since it's also easily read as a plural of check, not sure that really works. Having the two different kwargs may just be a necessary evil. EDIT: what about just check? "Check this thing" or "Check these things"?

EDIT AGAIN ROFLMAO: actually, having an iterable of checks has its own problem: must all of the checks yield False to stop exec, or only one or more? There are going to be use cases for both options. Plus, users can work around it relatively easily by just having one check which calls N other checks; and we can add an iterable checks later after thinking on it harder, without breaking backwards compat.

pombredanne commented 4 years ago

ping :) @bitprophet is this dead dead or... worthy of a resurrection?

mohnen commented 4 years ago

I know that I am late to the show, but have a look at pyinvokedepends