isi-vista / vista-pegasus-wrapper

A higher-level API for ISI Pegasus, adapted to the quirks of the ISI Vista group
MIT License
2 stars 1 forks source link

Allow users to create arbitrary dependencies in workflow #94

Open joecummings opened 3 years ago

joecummings commented 3 years ago

The current workflow the Pegasus Wrapper supports is one in which a KeyValueStore is put through a lot of transforms and comes out the end as a KeyValueStore. This makes sense as most of the NLP work we do involves pipeline models like this.

However, sometimes a user may want to specify a job that needs to run before another that does not involve mutating the KeyValueStore. This is difficult to do in the current infrastructure.

I'm not sure what the best approach is right now, but possibly a function that adds an edge between Job A and Job B directly. We would certainly have to do some checks to make sure the edge they add is valid.

spigo900 commented 3 years ago

Is this not possible with run_python_on_parameters(depends_on=[...])? It sounds like it might not but to clarify what we want here.

joecummings commented 3 years ago

Okay, let's see if I can explain this better.

Imagine we have Workflow DAG X that resembles the typical ML pipeline, where each job is transforming a corpus (adding a theory or two based on certain computations):

Job A ------ Job B ------ Job C ------ Job D

Now we have a slightly more complicated Workflow DAG X'.

Job Y-------------------
                         \
Job A ------ Job B ------ Job C ------ Job D

Job C depends on Job Y to run before it, and yes, depends_on is theoretically a solution. However, in practice, in order to use depends_on, the user would have to pass the DependencyNode generated by Job Y through the entire workflow to wherever Job C is instantiated. As "proper" Python developers, we don't tend to write workflows in which all our jobs are created in the same file. Ideally, I would love a way to identify the nodes before the DAG is generated and doing something like workflow.add_edge(Job Y, Job C).

For the record, I am open to other solutions if you have thoughts.

spigo900 commented 3 years ago

So if I am understanding you, the problem is about how we specify the dependencies. Your Job C is created in some other file (which might not be part of your project). It might not have a depends_on argument and it might be awkward to add one. Maybe the function that creates C also creates jobs A and B. Passing Y into this for C to depend on it would be messy and unintuitive (the easiest solution is to add a c_depends_on but then why not have those for A and B?). It might also be "bad style" since maybe this is an edge case that applies to just one workflow. So we probably want to do something else. Is that right?

I'm assuming here you do already have a reference to Jobs C and Y. Or is there also a problem of identifying Jobs C and Y (because say the only job you directly set up in your workflow is Job D)?

joecummings commented 3 years ago

I feel like that is an apt description. I think this does make me want to investigate further into how you (+ Jacob + Deniz + whoever else) are using Pegasus, because to me this doesn't seem like that unusual of a workflow, but perhaps it is.

I don't have trouble identifying jobs atm.