globus / globus-compute

Globus Compute: High Performance Function Serving for Science
https://www.globus.org/compute
Apache License 2.0
148 stars 47 forks source link

funcX task life-cycle #510

Open yadudoc opened 3 years ago

yadudoc commented 3 years ago

Is your feature request related to a problem? Please describe.

This is more of a technical problem that bothers us as developers than users, but having a well-defined task life-cycle would help explain to users what is going on with their functions.

Describe the solution you'd like Create a state-flow diagram that captures the states the function goes through on the web-service, forwarder, and endpoint. As a bonus, it would be good to extend this diagram to support retries.

Additional context This is necessary for #509

yongyanrao commented 3 years ago
Scenario Task Status
Queried a completed task {'pending': False, 'status': 'success', 'result': 'Hello World!', 'completion_t': '1623881816.333504'}
Queried a task in process {'pending': True, 'status': 'running'}
Queried a task that has been submitted to an offline endpoint {'pending': True, 'status': 'waiting-for-ep'}
Queried a task with exception (e.g., divided by 0) {'pending': False, 'status': 'failed', 'exception': <parsl.app.errors.RemoteExceptionWrapper object at 0x7f7d28c0ea50>, 'completion_t': '1623882199.3872406'}
Queried a task with exception (e.g., non-supporting import) {'pending': False, 'status': 'failed', 'exception': <parsl.app.errors.RemoteExceptionWrapper object at 0x7f619936ead0>, 'completion_t': '1623882314.4383392'}

Note:

  1. Exceptions can be treated properly by looking at fxc.get_task(res)['exception'].
  2. Currently, task status query can distinguish a task in process ('running') and an offline endpoint ('waiting-for-ep'). As long as web-service/forwarder does not receive endpoint's heartbeat or function result, it shows 'waiting-for-ep'. So a task submitted to an offline endpoint would have 'waiting-for-ep' status. Then, if the function result is received, the status turns to 'success', or if a heartbeat is received, the status turns to 'running'.
yongyanrao commented 3 years ago

Task state diagram https://miro.com/app/board/o9J_l-58NQg=/

There are two kinds of states:

  1. Terminal states: success or failed. For terminal states, we infer the pending status to be False.
    • completed, we need the function result and completion time
    • failed, we need the failure exception and completion time
  2. Intermediate states: For intermediate states, we infer the pending status to be True.
    • submitted
    • waiting-for-ep
    • dispatched-to-ep
    • running
yongyanrao commented 3 years ago

Items to discuss:

  1. If we want to keep submitted and waiting-for-ep two different states. As after receiving function submission, web-service/forward will immediately connect with the given endpoint. The connection would wind up with two possible outcomes, connection failure (due to any internet issue, we treat them all as endpoint offline) and dispatch work to the endpoint.
yongyanrao commented 3 years ago

Endpoint reports endpoint status and task status by executor.

  1. This is a mismatch. Because each report uses endpoint id as identifier, but all the information is collected on executor level. It will be an issue when there are multiple executors in an endpoint.
  2. Task status From endpoint's point of view, task statuses include WAITING_FOR_NODES, WAITING_FOR_LAUNCH, RUNNING, SUCCESS, and FAILED. From forwarder's point of view, task statuses include RECEIVED, WAITING_FOR_EP, WAITING_FOR_NODES, WAITING_FOR_LAUNCH, RUNNING, SUCCESS, and FAILED
yongyanrao commented 3 years ago

Task status definitions: RECEIVED: Task is in this state when the web-service has received the task submission. DISPATCHED_TO_EP: Task is in this state when the forwarder has dispatched the task to the endpoint, but it has not been acknowledged. WAITING_FOR_NODES: Task is in this state when endpoint is waiting for sending the task to be sent to funcx-manager. WAITING_FOR_LAUNCH: Task is in this state when endpoint is waiting for the task to be executed by funcx-worker. RUNNING: Task is in this state when it is being executed by funcx-worker. SUCCESS: Task is in this state when its execution is successfully completed with result returned. FAILED: Task is in this state when its execution failed with exception returned.