Get Dependency Graph and Task Config for each Task

HolzmanoLagrene commented 1 year ago

I have two feature requests for the API

Allow to fetch a graph for how the Jobs and Evidences are connected.

At the moment i have no clue what kind of Jobs actually will get started if I process an evidence with a set of Jobs. It would be really cool to be able to know beforehand what Jobs will get triggered based on the Evidence-Type and the initial Jobs selected.

For now i do this with some sort of twisted reflection and create a graph to get an impression of what is going to happen:

from turbinia.jobs.interface import TurbiniaJob
from turbinia.workers import TurbiniaTask
result = {}
for subclass in TurbiniaJob.__subclasses__():
    evidence_in = []
    evidence_out = []
    tasks = {}
    for name, class_ in inspect.getmembers(inspect.getmodule(subclass), inspect.isclass):
        class_hierarchy = inspect.getmro(class_)
        if TurbiniaJob in class_hierarchy:
            evidence_in += [a.__name__ for a in class_.evidence_input]
            evidence_out += [a.__name__ for a in class_.evidence_output]
        elif TurbiniaTask in class_hierarchy:
            tasks[class_.__name__] = class_.TASK_CONFIG
    result[subclass.__name__] = {"evidence_in": evidence_in, "evidence_out": evidence_out, "tasks": tasks}
n = pp.Network(directed=True)
for jobname, data in result.items():
    n.add_node(jobname, type="job")
    for ev_out in data["evidence_out"]:
        n.add_node(ev_out, type="evidence")
        n.add_edge(jobname, ev_out)
    for ev_in in data["evidence_in"]:
        n.add_node(ev_in, type="evidence")
        n.add_edge(ev_in, jobname)
    for taskname, config in data["tasks"].items():
        n.add_node(taskname, type="task", config=config)
        n.add_edge(jobname, taskname)

A visiual representation looks something like this:

Allow to fetch the Task Config for each Task

As it is possible for each evidence type to fetch the needed and possible parameters it would be amazing to be able to fetch the possible task parameters for each Task.

aarontp commented 1 year ago

Looks like an interesting script! Just in case you hadn't seen it, we have something similar in https://github.com/google/turbinia/blob/master/tools/turbinia_job_graph.py

Is this something that would be helpful to be in the API server, or is having the job graph script enough, or are there things we could add to that rather than adding it into the API server?

Another somewhat related feature request is to get this same graph for a given request after it has completed which would require tracking the same flow to understand more easily how things were processed.

Regarding getting the task config for each task given the evidence type: I think the part that is missing in order to do that is the Job -> Task mapping, which is currently done in each Jobs create_tasks method, so it doesn't have a static mapping for the Task types. We could potentially add another attribute similar to evidence_input and evidence_output though, and potentially even refactor out most of the create_tasks methods altogether. That being said, it should be easy to enumerate all tasks and their task config variables if that would be useful.

HolzmanoLagrene commented 1 year ago

Yes the possibility to get a graph of the Jobs that are going to be run would indeed be very interesting to me. Maybe it'll help if i describe the intended use case:

My idea is to get the Jobs that could possibly be run based on the Evidence-Type. As Jobs trigger other Jobs based on their Output-Types it is not always clear from the beginning what can be done in the first place. E.g. if I want to search to run a Grep-Job, this can only be done if a PlasoFile-Output is generated. This type however is only created if i run the Plaso-Job in the first place...If I know beforehand what Jobs will be run, I can provide them with the appropriate parameters to do what I want.

To do this, getting a graph that shows me the dependencies between Evidence-Types, Jobs and Tasks is the first step. The second step would be to get the parameters for each Task.

So in a nutshell I would love to have accessible through the API:

A graph representation of the dependencies preferably as json
The possibility to get the Task-Config for each Task

How should we proceed regarding both ideas?

aarontp commented 1 year ago

Would a static but regenerate-able representation of this data be OK instead of putting this into the API server? Ie. if we were to update https://github.com/google/turbinia/blob/master/tools/turbinia_job_graph.py to include .json output and the task configs, would that be good enough to meet your needs for this?

HolzmanoLagrene commented 1 year ago

It would be more than I was hoping for 🎉☺️

HolzmanoLagrene commented 9 months ago

How is the status on this? Did anyone have the time to look into this yet?

aarontp commented 8 months ago

@HolzmanoLagrene Sorry, I haven't gotten a chance to do that yet, but I'll try to carve out some time sometime soon.

google / turbinia

Get Dependency Graph and Task Config for each Task #1302

Allow to fetch a graph for how the Jobs and Evidences are connected.

Allow to fetch the Task Config for each Task