globus-gladier / gladier

An SDK for rapidly developing Globus Flows while leveraging Globus Compute
Apache License 2.0
9 stars 3 forks source link

Generate JSON Schema for Flows #107

Open NickolausDS opened 3 years ago

NickolausDS commented 3 years ago

Flows support JSON Schema for checking inputs here:

https://globus-automate-client.readthedocs.io/en/latest/python_sdk_reference.html#globus_automate_client.flows_client.FlowsClient.deploy_flow

I think input mismatches (Not specifying a critical piece of input required for a flow) are the main cause of the dreaded 'Flow Failed' errors. These are critical due to the lack of context around the error for why a given flow failed. Adding JSON schema that will detect any missing input should fix these.

NickolausDS commented 3 years ago

Initial investigation of this shows the problem is pretty darn complex, and really probably something better solved in Automate than Gladier. The problem is that a flow can fail if any inputs are left out. For example, if a state specifies foo be present with: 'foo.$': '$.input.foo', and the user does not specify foo, it results in a FlowFailed. Adding a schema does fix this problem, but requires a very carefully crafted input schema to account for anywhere $.input values are specified. Since $.input can be present anywhere in the flow definition, the whole flow definition needs to be traversed to find every possible $.input.x value so it can be added to the schema. This is true for both funcX functions and every other conceivable type of action provider input.

A secondary problem with gathering all the $.input values is determining what type they should be when the JSON Schema is generated. Since $.input values are not limited to FuncX flow states, I think the only real solution to determining what input should be with any accuracy is pinging the /introspect endpoint on each state's Action Provider interface. Type information is listed there, and would allow us to properly build a JSON Schema and reject incorrect types on $.input.

A third general problem with building JSON Schema is that there is much more besides $.input that can be present for a flow. The most common case is when Flow step B depends on Flow Step A. This looks like B.InputPath: $.A.details.result. Since the input to B depends on state A, I don't think JSON Schema will help us here.

NickolausDS commented 3 years ago

There is a middle-ground here I think we should take. Since most of our input (and problems with failing flows) happens when failing to provide certain $.input.x variables, it would be best to simply ensure variables are present but not worry about checking their types (avoid pinging /introspect for APs). Even though that won't solve all input problems, I think that gets us 80% of the way there without doing a whole bunch of extra work.

For FuncX functions, we can actually introspect the function signatures fairly easily and provide more specific information about what values should be. In practice, that looks like the following:

def hello_world(foo, bar):
    ....

Which builds the flow definition like so:

{'Comment': 'Flow with states: HelloWorld',
 'StartAt': 'HelloWorld',
 'States': {'HelloWorld': {'ActionScope': 'https://auth.globus.org/scopes/b3db7e59-a6f1-4947-95c2-59d6b7a70f8c/action_all',
                           'ActionUrl': 'https://automate.funcx.org',
                           'Comment': 'Hello World',
                           'End': True,
                           'ExceptionOnActionFailure': False,
                           'Parameters': {'tasks': [{'endpoint.$': '$.input.funcx_endpoint_compute',
                                                     'function.$': '$.input.hello_world_funcx_id',
                                                     'payload': {'bar.$': '$.input.bar',
                                                                 'foo.$': '$.input.foo'}}]},
                           'ResultPath': '$.HelloWorld',
                           'Type': 'Action',
                           'WaitTime': 300}}}

And additionally the JSON schema for ensuring both $.input.foo and $.input.bar are present when the user starts the flow.