serverlessworkflow / specification

Contains the official specification for the Serverless Workflow Domain Specific Language. It provides detailed guidelines and standards for defining, executing, and managing workflows in serverless environments, ensuring consistency and interoperability across implementations.
http://serverlessworkflow.io
Apache License 2.0
732 stars 146 forks source link

Clarification on supporting multiple actions and actionMode in Operation State #398

Closed yzhao244 closed 2 years ago

yzhao244 commented 3 years ago

I have this issue opened and would like to discuss the spec regarding the clarification on supporting multiple actions and actionMode in Operation state.

If my understanding is correct, in the specifications, the Operation State can contain multiple actions, and each action contains one function or one subflow. However, this approach cannot flexibly specify onErrors definition such as a retry policy for each action.

Furthermore, I looked a closed issue which has similar topic discussed before 140 . @tsurdilo @manuelstein @cathyhongzhang Thanks for the input of that issue, it was helpful for me to clarify some things. :)

However, I am still trying to clarify what are the use cases of using multiple actions in one Operation State vs simply using multiple Operation States which gain the flexibility of specifying onError definition for each function. When I looked at AWS Step Function which has similar concept called "Task" (I think it is quite equivalent to Operation) can only contain one function where user can specify retry definition for each function.

Therefore, it may be confusing to new users who are not fully aware of pros and cons about using multiple actions which lose the flexibility of specifying retry policy for each action in Operation State.

It could be just me having this question lol.. I would appreciate if you could please help me out. :)

yzhao244 commented 3 years ago

@tsurdilo I opened a new issue here. Sorry about the other one. :)

cdavernas commented 3 years ago

However, I am still trying to clarify what are the use cases of using multiple actions in one Operation State vs simply using multiple Operation States which gain the flexibility of specifying onError definition for each function.

@yzhao244 In my personal experience with an operational, proprietary SW runtime for a client, the possibility to use many successive operation vs multiple states is for organizational sake. As a matter of fact, using a single state per action can become quickly tedious, difficult to maintain, and might not correspond to the workflow's audience understanding of the flow.

An example is that for that client, we have to execute several REST functions on a 3rd party, legacy app, in order to retrieve a document: based on user input, find the id of the document thanks to its name, review it programatically, then publish it. If we were to split those actions into as many states, instead of having one state 'PublishDocument', we would have something like 'GetDocumentIdFromName', 'ReviewDocument' and 'PublishDocument' states, which was far less readable and "logic" to my client.

In other cases, you do not have the choice: for some complex operations, you WILL need to use stateDataFilter to do complex processing on the data returned by an action, and WILL therefore need several states instead of one. Still, even then, I'd rather use a subflow "grouping" logically those states rather than having this huge god workflow, with numerous states that do not represent the workflow as it is understood by humans.

I hope this at least partially answers your question ;)

tsurdilo commented 3 years ago

@yzhao244

to add to @cdavernas answer,

this is indeed a workflow definition design choice based on your requirements (whether to have a single function invocation in your actions or multiple).

A State (or Task in ASL as you mentioned) for us represents a unit of work. An action definition can be part of this unit of work, but in some cases you can have multiple actions represent a single one. Error handling in SW currently is based on errors raised during unit of work / state execution, and so are retry definitions. So yes, the way of defining a retry definition for a single action (function invocation) is to have a single action associated with your action definition.

Do you have some use cases where you think this is better handled in a different way(s)?

yzhao244 commented 3 years ago

@tsurdilo @cdavernas Thanks for the reply, guys. As discussed in the weekly meeting, our Serverless Workflow spec could be considered as a superset of other ASL in the industry which only supports a single function in a state. Therefore, it is just an end-user's design choice for choosing a single function or multiple functions in a state.

The main reason for me to open the issue is that to me, serverless workflow is something that brings stateless functions(e.x. REST APIs) into stateful functions where states are persisted based on state definitions. Eventually, building many stateful functions together into a workflow for representing a piece of business logic. Therefore, supporting multiple actions in one single state loses that flexibility of persisting state for each stateless function and allowing retry for each function.

However, this could be just me thinking of this way. lol. This could be totally up to the end-user's design whether they do need all states for all functions or multiple functions represent in single state. :)

tsurdilo commented 3 years ago

Serverless Workflow spec could be considered as a superset of other ASL in the industry...

It's not just invocation of functions, features like event handling, error handling, timeouts, callbacks, integration with openapi, grpc, cloudevents, etc etc are all features where Serverless Workflow DSL is already a super-set (feature-wise) of pretty much any DSL-based workflow language.

If you look at what's happening with workflow DSLs right now I personally put my money down that a ton of companies that are developing their own proprietary DSLs are basing them on Serverless Workflow ...and that's for a reason.

tsurdilo commented 3 years ago

@yzhao244 @cdavernas @ricardozanini et all - to give more info regarding runtime impls and this question I wanted to explain why retries are on the workflow state level and not on individual action level:

Retries should be tied to business logic, meaning that forcing retries on each action individually is not always what you want to do. In the cases where you want to do it, ok, put a single action in operation state and retries will be specific to that.

In most cases where you have multiple actions in actions block in a state they are an "execution block" or a "group of tasks" that together define some sort of thing user wants to perform. In that case it makes much more sense to retry on all of them together.

To show this with an example lets say that we have operation state "OP" which defines retry definition "RET" and state timeout definition "TIME" The state is responsible to process a file and to process is we have to: 1) Download the file 2) Process the file 3) Store the File in S3 for example

In this case you can see the state as a function which performs 3 tasks. It makes no sense to retry tasks individually as that is a waste of time, the retry should happen on the whole sequence of tasks. So as pseudo-code you could see it as for example:

State.retry(RET, TIME, () -> OP())

meaning you retry the state, with the reference retry definition info, and OP() is the "function" which is made of the the 3 tasks we mentioned.

If we look at this from the "bpm" point of view, all actions of a workflow state would be in a "subprocess" and the retries (meaning the boundary error event going to a timer event and then transitionng back to for example a gateway that goes back to the subprocess) are on that subprocess boundary and not individual task boundaries.

Hope this helps.

cdavernas commented 3 years ago

@tsurdilo Very good example, thanks man! However, in some case atomicity is preferable IMHO.

Let's take your example: I download the file, process it with, say, ffmpeg before uploading it. Imagine that file is a 5go ripped DVD that took already forever to download. Let's not even speak of the FFMPEG process... Now, last action fails to upload because S3 is temporarily unavailable. With state level retry, I'll have to replay the 2 first states for no reason, using btw extreme amount of resources: not too cloud (or serverless) friendly. You could as well imagine that the file is downloaded in a temp file, or stored into something like SVN, which, depending on which commands you used, might throw an exception (file already exists), thus making state level retries utterly useless.

In other words, I've nothing against state level retries, but is IMO a convenient feature at best, as atomicity of actions and retries is, again IMHO, a best practice, and one of the great differentiator of our DSL.

tsurdilo commented 3 years ago

@cdavernas

Let's take your example: I download the file, process it with, say, ffmpeg before uploading it. Imagine that file is a 5go ripped DVD that took already forever to download. Let's not even speak of the FFMPEG process... Now, last action fails to upload because S3 is temporarily unavailable. With state level retry, I'll have to replay the 2 first states for no reason, using btw extreme amount of resources: not too cloud (or serverless) friendly. You could as well imagine that the file is downloaded in a temp file, or stored into something like SVN, which, depending on which commands you used, might throw an exception (file already exists), thus making state level retries utterly useless.

Good point, my take on it is "depends on the runtime" :) When you execute a batch of actions in an actions array your runtime as it executes each action anyways, can record which ones have been successfully completed. On a replay then you don't have to replay each one. Each action has a result and if an action produces a result (does not throw an error that is then handled in dsl and can cause a defined retry) you can record it and on replay not execute it again but use that already recorded result.

One thing that we could do for actions imo is to create action groups, for example:

  "actions": {
     "groupA": [ ....],
     "groupB": [ ....]
  }

and on exception which defines a retry your runtime would know which group raised it and retry that group only. This could eliminate the need to define multiple workflow states that was mentioned in this thread. Do you think that would help?

(this is anyways something I wanted to add for a while as it would allow us to reuse action "groups" in multiple states - define once and reference in multiple states).

ricardozanini commented 3 years ago

Nice idea about grouping the actions. :+1: Runtimes could even save the entire group state.

cdavernas commented 3 years ago

@tsurdilo Great idea, but I'm personally not a huge fan of it. It would solve the problem raised by @yzhao244, but does it in a somewhat tricky way IMHO.

The way I see it, we should be able to reference timeout and error handling strategies in both states and actions. Doing so would allow the author to cherry pick the actions that require a specific retry before eventually bubbling the error up after unsuccessful handling.

cdavernas commented 3 years ago

Another idea could be to provide actions with a retry: true/false property. In case it is set to false, if the state attempts a retry, the action would be ignored.

tsurdilo commented 3 years ago

I think i understand. Let talk via example json/yaml orherwise i think we could end up in long discussions that bring no value as we miss individual points completely. I think lets start from the root which imo is that retries are bound to error types currently. imo thats where we should look into making the first change and go from there, meaning retries should be tied to action(s) and actions should apecify list of errors for which they should not be retried for.

On Sat, Jul 3, 2021 at 7:21 AM cdavernas @.***> wrote:

Another idea could be to provide actions with a retry: true/false property. In case it is set to false, if the state attempts a retry, the action would be ignored.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/serverlessworkflow/specification/issues/398#issuecomment-873392367, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5E7WV4L7WS266QGRF5QLTV3XE3ANCNFSM46BZCKOQ .

yzhao244 commented 3 years ago

@tsurdilo @cdavernas @ricardozanini @manuelstein Hi guys, I think it is a good idea to reference timeout and error handling/retry strategies in both states and actions. Based on my personal experiences with end-users, the biggest selling point of Cloud serverless workflow services(such as AWS step function, Huawei Function Flow and so on) is the state management plus error/retry handlings which end-user would buy in. Without state management + error/retry handling, end-user would usually just choose lambda service itself because one lambda can directly calls another lambda by using lambda destination feature as an example.. lol, it will be cheaper for end-user. Therefore, I think it is good to support distinguishing error handling with which function it has occurred.

I created two JSON examples to our specs below which I hope to trigger more discussions for referencing timeout, error handling/retry stretegies lol..

Example 1

"actions": [
            {
              "functionRef": "StorePatient"
              "actionRetry: 
                   "errors":
                        ["StorePatient.InternalError.Busy"]
                   "delay": 10
                   "maxAttempts:": 2
            },
            {
              "functionRef": "AssignDoctor"
            },
          ],
        "onErrors": [
        {
          "error": "StorePatient.InternalError.Busy",
          "code": "503",
          "end": true
        }
      ] 

Example 2

"actions": {
              "GroupA":[
                   "functionRef": "StorePatient",
                   "actionOnErrors": [
                    {
                      "error": "ServiceNotAvailable",
                      "retryRef": "StorePatientRetryStrategy",
                      "code": "503", 
                    }
              ],
              "GroupB":[
                   "functionRef": "AssignDoctor"
              ],
        "retries": [
        {
          "name": "StorePatientRetryStrategy",
          "delay": "PT3S",
          "maxAttempts": 10
        }
      ]
tsurdilo commented 3 years ago

working on error handling part for timeouts i am not yet sure

tsurdilo commented 3 years ago

rly like the use of action groups in example!!

tsurdilo commented 3 years ago

https://github.com/serverlessworkflow/specification/pull/435 solves imo most of these mentioned issues. we will look into adding groups in our next spec release (after 0.7).

yzhao244 commented 3 years ago

@tsurdilo Thanks for adding ActionRetries in 0.7 :) . For the idea of Actions group, should I raise separated issue for discussion?

tsurdilo commented 3 years ago

@yzhao244 sure. or we can finally start using our discussions board https://github.com/serverlessworkflow/specification/discussions :)

tsurdilo commented 2 years ago

@yzhao244 can this issue be closed?

yzhao244 commented 2 years ago

@tsurdilo Yes, I am going to close this one. Thx :)