taskcluster / taskgraph

Generates task dependency graphs for Taskcluster CI
Mozilla Public License 2.0
16 stars 42 forks source link

Support custom hooks between `full_task_set` and `full_task_graph` phases #424

Open eu9ene opened 9 months ago

eu9ene commented 9 months ago

Would it be possible to introduce a hook similar to transforms but for the full graph to run it before adding the edges? This would allow connecting the graph dynamically based on parameters. This is something that can be done in individual transforms by modifying dependencies but having it in one place would simplify the code significantly. I'm thinking it could be called here. It would also allow not specifying dependencies for some tasks in yaml, so they become optional unless included in the graph.

We have a pretty complex use case for graph modifications based on parameters, here is the full context.

ahal commented 9 months ago

So basically you want a custom hook that takes the parameters and the full set of generated tasks as input? I don't love making the generation process even more complicated, but I guess we could do that..

A possible (very hacky) way to accomplish this today might be to create a special kind that depends on every other kind in its kind-dependencies. Then it would have access to the full graph via config.kind_dependencies_tasks in a transform. Once done it could simply not yield anything. This is admittedly pretty terrible, but could be a way to try this out before we commit to adding a feature like this upstream.

(Note your link is pointing to the end of the "kind graph" rather than the task graph. I think you meant to link around here: https://github.com/taskcluster/taskgraph/blob/85c360e3e2adf94d64a14988f560f6d591509483/src/taskgraph/generator.py#L326)

bhearsum commented 9 months ago

Just to restate what I think I'm reading: It sounds like the idea here is that we'd generate task definitions in one stage (but explicitly not any dependencies, fetches, or anything else that requires information from other tasks), and then a subsequent stage would deal with all of the connections.

That's obviously a valid way to generate a task graph, but it's almost an entirely different thing than what Taskgraph does, and I'm not even sure it plays well with how transforms work at the moment. For example, it would necessarily have to ignore kind-dependencies, and I'm not sure it fits in with many other taskgraph features. For example, there's lots of logic in well used transforms around automatically setting dependencies for fetches, docker image tasks, etc. Additionally, cached_tasks also relies on dependencies to function properly.

I don't want to say that it can't or even shouldn't be done with Taskgraph, but if/when we support this formally I think we need to spend some time thinking about how it will work and interact with other parts of Taskgraph.

ahal commented 9 months ago

I don't think it would necessarily need to ignore dependencies / kind-dependencies..

I think this could be as simple as moving this for loop: https://github.com/taskcluster/taskgraph/blob/85c360e3e2adf94d64a14988f560f6d591509483/src/taskgraph/generator.py#L329

Into a standalone function. Then we can cover our eyes and say "I can't see you!" as firefox-translations monkeypatches it to do whatever they want :p

For example, there's lots of logic in well used transforms around automatically setting dependencies for fetches, docker image tasks, etc. Additionally, cached_tasks also relies on dependencies to function properly.

Actually my hacky special kind suggestion from my previous comment would play nicely with that.. just omit any kinds you don't want to post process from the special kind's kind-dependencies.

ahal commented 9 months ago

That said if there are existing transforms that accomplish what we need already, my first choice would be to make those even better. Hide away some of the complexity, make them more generic, simplify the task interface etc

eu9ene commented 9 months ago

I like the idea of giving this approach a try using a transform that can access all the tasks or a simple monkey-patching of a function first.

I actually meant this place in the code, where it takes kind dependencies and builds Graph:

https://github.com/taskcluster/taskgraph/blob/85c360e3e2adf94d64a14988f560f6d591509483/src/taskgraph/generator.py#L294

We could modify them before that. The idea is some dependencies might not be specified in the yaml at all and populated only by the hook. I'm not sure if it's possible not to connect kinds and connect only tasks or connect both on for t in full_task_set: step.

bhearsum commented 9 months ago

I like the idea of giving this approach a try using a transform that can access all the tasks or a simple monkey-patching of a function first.

I actually meant this place in the code, where it takes kind dependencies and builds Graph:

https://github.com/taskcluster/taskgraph/blob/85c360e3e2adf94d64a14988f560f6d591509483/src/taskgraph/generator.py#L294

We could modify them before that. The idea is some dependencies might not be specified in the yaml at all and populated only by the hook. I'm not sure if it's possible not to connect kinds and connect only tasks or connect both on for t in full_task_set: step.

This is unlikely to be a good place. At this point the only think we have is a graph of the relationships between the kinds. At this point we've read the raw yaml from the kinds, but we haven't run the transforms. That happens in load_tasks, and as @ahal mentioned earlier, after this is probably the best place to do any fiddling. I can help you get set up to experiment with this, if you like.

bhearsum commented 9 months ago

For example, there's lots of logic in well used transforms around automatically setting dependencies for fetches, docker image tasks, etc. Additionally, cached_tasks also relies on dependencies to function properly.

Actually my hacky special kind suggestion from my previous comment would play nicely with that.. just omit any kinds you don't want to post process from the special kind's kind-dependencies.

This is true, but I guess my point is more that you could easily end up with situations where you have tasks from kind B which uses fetches from kind A, and then kind C uses this new hook, and removes or alters one or more of those fetches, leading to confusion if you're only looking at kinds A & B. I guess the general point is taskgraph is built in a way where kinds generally deal with all of their own data, and you can usually understand how a task is made by looking at a kind and its transforms. This change would mean you can't make any assumptions like that any more. (To be fair, optimizations and morphs also do this sort of graph level modification, but they are used in very specific ways, and explicitly are not supposed to alter the meaning of the graph.)

To be perfectly clear: I'm not at all against the idea of a system where task definitions and task graph generation are segregated from one another - I'm just concerned that making it a very official part of taskgraph will lead to loads of confusion down the road.

A generic "post task loading hook" (which I think is what you're suggesting @ahal?) is probably okay enough.

bhearsum commented 8 months ago

A generic "post task loading hook" (which I think is what you're suggesting @ahal?) is probably okay enough.

We talked about this a bit this week, and indeed, this is one of the ideas being suggested. I have no issues with this fairly generic approach.