Event data filter should have access to all event attributes

serverlessworkflow / specification

Contains the official specification for the Serverless Workflow Domain Specific Language. It provides detailed guidelines and standards for defining, executing, and managing workflows in serverless environments, ensuring consistency and interoperability across implementations.

http://serverlessworkflow.io

Apache License 2.0

743 stars 147 forks source link

Event data filter should have access to all event attributes #284

Closed jorgenj closed 3 years ago

jorgenj commented 3 years ago

What would you like to be added:

Update event related features to allow access to the full cloud-event attributes, not just the payload.

Why is this needed:

Currently, the spec says that the workflow can only access the data payload of the cloud events it receives. This means a workflow can't access id, timestamp, data_base64, etc. The argument being that the runtime might receive a cloud-event via HTTP in which case most of the parameters are presented as headers (ie. not all protocols have the full cloud-event as json). As far as I can tell, regardless of the wire-format, any cloud-event received should be possible to translate to a json object. This json format is what the runtime should expose to a workflow (regardless of the wire-format it was received in).

tsurdilo commented 3 years ago

I will try to get the CE guys to give us a clarification on this because I am not 100% sure that converting the entire message to JSON is possible/feasible, but I might be wrong.

tsurdilo commented 3 years ago

@jorgenj from spec perspective tho, if this is possible i still think that we can only enforce the required attributes, namely "id", "source", "specversion", "type" - https://github.com/cloudevents/spec/blob/v1.0.1/spec.md#required-attributes Otherwise this would affect portability (custom context attributes/extensions).

Note that even then since "source" and "type" are required in the event definition in the workflow, you know those already if you consume one of those events. That really leaves the "id" and "specversion", which u could also control as part of the correlation definition. So my question is what use case would you have in order to need the unique event id that you dont get from its source and type attributes already?

given that "data" is optional (if event does not contain a payload) we should probably state that as well in doc that its possible

regarding data_base64, let's as the CE guys that as well, as allowing that could break the workflow/state data (json)

jorgenj commented 3 years ago

if this is possible i still think that we can only enforce the required attributes

That's fine, if a runtime receives a cloud-event via http (or some other protocols) we can't guarantee that all custom event attributes will be found, that's a limitation of certain runtime implementations. But afaict it's perfectly valid to have 'extension' attributes in the top-level attributes, and this completely precludes them from accessing those in a workflow right now.

That really leaves the "id" and "specversion", which u could also control as part of the correlation definition.

Actually, this leaves "id", "specversion", "subject", "time", "dataschema", "datacontenttype", "data_base64".

regarding data_base64, let's as the CE guys that as well, as allowing that could break the workflow/state data (json)

data_base64 appears to be a string containing base64 encoded data. That's perfectly fine as a string in json, afaict.

tsurdilo commented 3 years ago

After talking to Doug, +1 to move on this Since we enforce CE format anyways, he mentioned that it could always be serialized into JSON so event filter would be able to query/select against it..like you said @jorgenj :)

tsurdilo commented 3 years ago

i would like to ask for an example to be added with this change to examples page that shows an use case please. even tho CE can be converted to json, i think for situations where you are consuming a TON of events and need high scalable performance converting thousands of events to json before you can actually do anything is a huge overhead...so if we add this we need a good example to justify :)

jorgenj commented 3 years ago

@tsurdilo

I don't understand how this change has any impact on scale, with the current state of the spec the runtime is already deserializing the 'payload' to json, the 'overhead' of adding a few more fields to the json should be inconsequential.

I would invert the requirement and say that we should have had very strong justification for why the current spec only allows access to the payload.

tsurdilo commented 3 years ago

@jorgenj the event data (payload) contains the actually intended message for the business logic expressed in the workflow. Context attributes are typically used for identification of such message (and correlation). I don't see in what use case you would need that when implementing business requirements (and if you do I think that is an anti-pattern probably, meaning all data that the events wants to express to the consuming systems should be inside the payload). please let me know

jorgenj commented 3 years ago

In the weekly community discussion, a suggestion was made to make it so the defaults still act on the payload portion, thus giving the ability for workflows to access other fields on the event but by default just extracting the payload portion.

jorgenj commented 3 years ago

@tsurdilo

Since you asked for an example, here's the most obvious example where someone would need access to the envelope of the event. In my humble opinion it's an 'anti-pattern' to suggest that users should be forced to duplicate this inside the payload for such a common use-case.

id Type: String Description: Identifies the event. Producers MUST ensure that source + id is unique for each distinct event. If a duplicate event is re-sent (e.g. due to a network error) it MAY have the same id. Consumers MAY assume that Events with identical source and id are duplicates.

Consider any use-case where a user wants to create a workflow that talks to a downstream service for ordering hamburgers. The workflow is initiated by the receipt of a 'giveMeHamburgers' events, and they want to ensure that for every unique event one and only one hamburger is created in the downstream service.

A reasonable solution would be to use the unique id of the event for making idempotent requests to the downstream service.

tsurdilo commented 3 years ago

@jorgenj thanks for the explanation. If I look at this from the user perspective it would be much easier for me to define uniqueness on the event definition level, for example:

"events": [
        {
            "name": "BloodPressure",
            "type": "patient.vitals.BloodPressure",
            "source": "patients/vitals",
           "unique": true,
            "correlation": [
              { 
                "contextAttributeName": "patientId"
              } 
            ]
        }
]

than having to write expressions to enforce it manually. wdyt?

jorgenj commented 3 years ago

@tsurdilo

These are 2 completely different concepts, one is event uniqueness and the other is business-logic uniqueness. If my use-case is to have a workflow that acts on each individual event than what you propose is completely useless.

If you think that 'id' isn't useful then perhaps you should file a PR against the cloud-events project to remove that field from their spec.

tsurdilo commented 3 years ago

I am confused, earlier in the hamburgers example event uniqueness was part of your business logic, but now it's not? Since workflow is the orchestrator of the business logic (to order hamburgers) i would assume that the uniqueness requirement would fall on the workflow as it describes the business logic and not the downstream service. If you define in your workflow that the event consumer should give you only unique events of type X, then i think that solves your hamburger problem without the downstream service having to deal with defining how you as a company define "unique".

tsurdilo commented 3 years ago

Regarding If you think that 'id' isn't useful then perhaps you should file a PR against the cloud-events project to remove that field from their spec. hehe nice :) I don't think its useless but that it should not be used for business logic. You might say id+source defines "unique", another project might say just "id" is enough to define the event as unique, some might consider the type as well maybe. The point is that from the business logic perspective uniqueness should be deal with on the runtime level (consuming events and pushing them to workflows) and not your business logic imo

tsurdilo commented 3 years ago

I guess if we define which parts of the event should be merged into the state data, then this is all fine to do. the discussion about uniqueness I assume is domain-specific as how different users see it.

jorgenj commented 3 years ago

I am confused, earlier in the hamburgers example event uniqueness was part of your business logic, but now it's not?

My exact words:

they want to ensure that for every unique event one and only one hamburger is created in the downstream service

To me that means I can leverage the uniqueness provided to me by cloud-event 'id', and I don't have to solve an already solved problem by adding some unique identifier to the payload.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.