re-design events resource and mechanisms for registering new events

cjdcordeiro commented 3 years ago

the events resource has a very simple schema and it is being underused.

Starting by the NuvlaBox, all relating actions (i.e. activate, commission, decommission, + custom actions , etc.) should generate an event, to be tracked in the NB detail page

0xbase12 commented 3 years ago

Current Event resource schema:

::timestamp              date-time?
::category               #{"state" "alarm" "action" "system"}
::severity               #{"critical" "high" "medium" "low"})
::content/resource/href  id?
::content/state          string?

~Changes:~ ~- triggered-by: user? (user causing the event creation)~ ~- parent: id? instead of content/resource/href~ ~- state string opt:string? ~ ~- new-state: opt:string? (only for category "state")~

~Events specification~ ~- Immutable (once created never change, edit not allowed~ ~- Event acl: Event are owned by group/nuvla-admin. Views rights are extracted from the concerned resource.~

Creation conditions:

Creation of Events are implemented server side.

Action category
- NuvlaBox
- commission
- decommission
- add-ssh-key
- revoke-ssh-key
- update
- reboot
- delete
- Deployment
- start
- stop
- update
- delete
State category
- Nuvlabox
- Deployment
Edition of a group
- user added
- user removed

Condition for deleting events:

Events are kept for 1 year

0xbase12 commented 1 year ago

Suggestion reuse existing event resource in Nuvla

complete (events registration), make it more systematic and make it consistent
change event spec to something more reusable generic
check if there is a cleanup somewhere for event index (seems to be 1 year)
set acl in such way that the event is visible for the user/group
event should register the session-id / user-id / active-claim
enabled by default for all resources, could be disabled at resource level or by admin (maybe global, maybe configurable by admin in configuration resource)

When we should register an event:

User call an action on a resource
Resource state/status(sometimes) change
Session creation
Session delete
Edit ?
Delete resource
Bulk delete ?
Bulk operations?
what else?

To be completed

alebellu commented 1 year ago

Type of events

From my current understanding, we would like to have events to represent both:

when something happens in the api-server (CRUD operations on resources, actions triggered, async actions complete)
when async jobs or other external (to the api-server) actions complete, with success or failure. See for example https://github.com/SixSq/tasklist/issues/1918

Below I analyze only the first kind of events (internal to the api-server). It should be the responsibility of the external components (the job engine, etc.) to signal interesting events (but the api-server can provide a means of doing it, for example by allowing creation of events via http calls).

As-is analysis of event processing in api-server

Severities

One of: ["critical" "high" "medium" "low"]

Event creation

Current way to add an event is by calling sixsq.nuvla.server.resources.event.utils/create-event.

Example of event created:

{:resource-type "event",
 :content {:resource {:href "deployment/847a79a2-993a-443a-bc4e-506d7424e1e9"},
           :state "deployment/847a79a2-993a-443a-bc4e-506d7424e1e9 created"},
 :severity "medium",
 :category "action",
 :timestamp "2023-07-24T13:37:55.464Z",
 :acl {:owners ["user/jane"]}}

which is created from this creation payload:

{:params {:resource-name "event"},
 :body {:resource-type "event",
        :content {:resource {:href "deployment/847a79a2-993a-443a-bc4e-506d7424e1e9"},
                  :state "deployment/847a79a2-993a-443a-bc4e-506d7424e1e9 created"},
        :severity "medium",
        :category "action",
        :timestamp "2023-07-24T13:37:55.464Z",
        :acl {:owners ["user/jane"]}},
 :nuvla/authn {:user-id "internal",
               :active-claim "group/nuvla-admin",
               :claims #{"group/nuvla-admin" "group/nuvla-anon" "group/nuvla-user"}}}

`sixsq.nuvla.server.resources.event.utils/create-event` is currently called in:

- `sixsq.nuvla.server.resources.deployment/create-deployment` - Call: `(event-utils/create-event deployment-id (get-in create-response [:body :message]) (a/default-acl (auth/current-authentication request)))` In turn called in: - method `crud/add` - method `crud/do-action [resource-type "clone"]` - `sixsq.nuvla.server.resources.deployment-parameter` method `crud/add` - Call: `(event-utils/create-event (:parent resource) value acl :severity "medium" :category "state")` - `sixsq.nuvla.server.resources.deployment-set/create-job` - Call: `(event-utils/create-event (:id resource) action (a/default-acl (auth/current-authentication request)))` In turn called in: - method `crud/add` - method `crud/do-action [resource-type utils/action-create]` - method `crud/do-action [resource-type utils/action-start]` - `sixsq.nuvla.server.resources.deployment-set/action-bulk` - Call: `(event-utils/create-event (str resource-type "/" uuid) action (a/default-acl (auth/current-authentication request)))` In turn called in: - method `crud/do-action [resource-type utils/action-start]` - method `crud/do-action [resource-type utils/action-stop]` - `sixsq.nuvla.server.resources.infrastructure-service/post-delete-hooks` - Call: `(event-utils/create-event (str resource-type "/" uuid) "DELETED" (a/default-acl (auth/current-authentication request)) :severity "low" :category "state")` - `sixsq.nuvla.server.resources.infrastructure-service/event-state-change` - Call: `(event-utils/create-event id new-state (a/default-acl (auth/current-authentication request)) :severity "low" :category "state")` In turn called in: - `sixsq.nuvla.server.resources.infrastructure-service` method `crud/edit` - `sixsq.nuvla.server.resources.infrastructure-service-coe/create-job` - In turn called by `crud/do-action` - `sixsq.nuvla.server.resources.infrastructure-service-coe` method `sixsq.nuvla.server.resources.infrastructure-service/post-add-hook` - Call: `(event-utils/create-event id "STARTING" (a/default-acl (auth/current-authentication request)) :severity "low" :category "state")` - `sixsq.nuvla.server.resources.infrastructure-service-coe/create-job` - Call: `(event-utils/create-event (:id resource) (format "created job %s with id %s" job-name job-id) (a/default-acl (auth/current-authentication request)))` - `sixsq.nuvla.server.resources.infrastructure-service-generic` method `infra-service/post-add-hook` - Call: `(event-utils/create-event id (:state service) (a/default-acl (auth/current-authentication request)) :severity "low" :category category)` - `sixsq.nuvla.server.resources.nuvlabox` method `decommission-async` - Call: `(event-utils/create-event id (str "decommissioning " id " with async " job-id) (:acl request))` In turn called in: - method `crud/do-action [resource-type "decommission"]` - `sixsq.nuvla.server.resources.nuvlabox/check-api` - Call: ``(event-utils/create-event (:id resource) (str "checking the API for NuvlaBox " id " with async " job-id) (:acl resource))`` In turn called in: - method `crud/do-action [resource-type "check-api"]` - `sixsq.nuvla.server.resources.nuvlabox/reboot` - Call: `(event-utils/create-event (:id resource) (str "sending reboot request to NuvlaBox " id " with async " job-id) (:acl resource))` In turn called in: - method `crud/do-action [resource-type "reboot"]` - `sixsq.nuvla.server.resources.nuvlabox/cluster-nuvlabox` - Call: `(event-utils/create-event (:id resource) (str "running cluster action " cluster-action " on NuvlaBox " (:id resource) ", with async " job-id) (:acl resource))` In turn called in: - method `crud/do-action [resource-type "cluster-nuvlabox"]` - `sixsq.nuvla.server.resources.nuvlabox/add-ssh-key` - Call: `(event-utils/create-event (:id resource) (str "asking NuvlaBox " (:id resource) " to add SSH key " cred-id " with async " job-id) (:acl resource))` In turn called in: - method `crud/do-action [resource-type "add-ssh-key"]` - `sixsq.nuvla.server.resources.nuvlabox/revoke-ssh-key` - Call: `(event-utils/create-event (:id resource) (str "removing SSH key " (-> request :body :credential) " from NuvlaBox " id " with async " job-id) (:acl resource))` In turn called in: - method `crud/do-action [resource-type "revoke-ssh-key"]` - `sixsq.nuvla.server.resources.nuvlabox/update-nuvlabox` - Call: `(event-utils/create-event (:id resource) (str "updating NuvlaBox " id " with target release " nb-release-id", with async " job-id) acl)` In turn called in: - method `crud/do-action [resource-type "update-nuvlabox"]`

Actions

Sometimes actions on a resource trigger crud operations on other resources, for example deployment-set create triggers crud/add on a job resource. Other times an action does not trigger any crud operation, e.g deployment-set plan. And in other cases an action can have impact on the resource it is being called upon, and on other resources as well; e.g deplyment-set start edits the deployment-set resource and can create a new job as well.

Events retention

Events are retained in ES for 1 year, then they are deleted.

Summary

Events are currently created via a call to event-utils/create-event:
- The resource-href passed to create-event is the resource id most of the time, but not always (e.g for deployment-parameter it is the parent id)
- The state (second parameter to create-event) is currently an action map or a message
- acl is sometimes that of the resource, other times is the default acl for the current user
event-utils/create-event is in turn called by either:
- CRUD actions (crud/add, crud/edit, crud/delete)
- crud/do-action
The outcome of an operation is often represented directly in the http response object, so it probably makes sense for an event processing generic utility to take the http request and response as inputs;
Events are retained for 1 year.

Goals

Generate an audit trail of significant events, regarding crud operations, sync and async actions, bulk actions;
Provide search functionality to list events related to specific resources.

Non-goals

To keep a fully re-playble history of events as in event sourcing.

Even though this is not a goal at the moment, I think it would be a good a idea to provide enough information in the events that we decide to track to make them re-playable, unless it complicates the code too much. It would be less space efficient than only keeping track of essential attributes, but it would give full traceability, and would also keep all options open regarding the possible uses that we might want to make of registered events.

Proposal

Event spec

change the spec of the event resource type:
- add a mandatory event-type field, to uniquely identity the type of event
- for CRUD operations: resource-created, resource-updated, resource-deleted
- for actions: <resource-type>/<action name>
- for bulk actions: <resource-type>/bulk-<action name>
This field can be useful for filtering events. Maybe move it under content ? Clarify how content is used in general.
- add session-id, user-id and active-claim. Reuse utility fns in sixsq.nuvla.auth.utils to fill them in:
  - session-id <- (auth/auth/current-session-id request)
  - user-id <- (auth/current-user-id request)
  - active-claim <- (auth/current-active-claim request)
Maybe move it under content ? Clarify how content is used in general.
- add an optional message field
Maybe move it under content ? Clarify how content is used in general.
- content with required keys
- resource.href: the href of the specific resource involved
- resource.type: the type of resource involved
- payload: field to store arbitrary data for the event. This field would replace the existing state field (a state change might be represented as part of the payload). The format will depend on the event-type. For example:
  - resource-created: store the request payload, but also the full resource created, including the generated attributes
  - resource-updated: store the request payload, and the fields that end up being updated Exactly what to store in the payload field might be customizable via the resource event registry when it makes sense to do so.
- parent: there is already an optional parent property in the common attributes of all resources. I would use it for events to point to the parent event when relevant (see below).
- correlation-id: Maybe add this later, if we see the need. It is a concept defined in the event sourcing/CQRS domain, it might be useful in case of long chain of events, to easily track an entire conversation. See for example:
- https://blog.arkency.com/correlation-id-and-causation-id-in-evented-systems/
- https://github.com/thenativeweb/commands-events/issues/1
Basically every event is suggested to have 3 ids:
- ID: Identifier of a single command / event.
- Causation ID: Pointer to the predecessor of a command / event.
- Correlation ID: Pointer to the first command / event in a chain to identify a long-running transaction.
In these terms, the parent property above would represent the Causation ID.

[UPDATE 02/08/2023] In the first iteration I tried to get rid of the :content top-level key, to see what happens.

Resulting top-level spec would be something like:

(s/def ::resource
  (-> (st/spec (su/only-keys :req-un [::href]
                             :opt-un [::common/resource-type]))))

  (s/def ::schema
    (su/only-keys-maps common/common-attrs
                   {:req-un [::timestamp
                             ::event-type
                             ::message
                             ::content  {:req-un [::resourcce {:req-un [::type, ::href]}
                                                  ::payload]}
                             ::category
                             ::severity
                             ::user-id
                             ::active-claim]
                    :opt-un [::session-id]}))

Acl

Set acl in such a way that the event is visible for the user/group that created it.

Something like this ?

{:owners      ["group/nuvla-admin"]
 :view-meta [(auth/current-active-claim request)]}

Is this enough? Not sure yet..

Event config registry

Have a global registry of events configurations, keyed by resource type.

add a global event-config-registry atom, as a map of this form:
- key is a resource type
- value is in turn a map, of this form:
- key is <crud operation id> or <action id>
- value is a map with the following keys:
  - enabled: true if the crud operation or action should trigger events for the given resource type. Default true.
  - event-type: the default event type For example:
  - resource-created for crud/add
  - resource-updated for crud/edit
  - resource-deleted for crud/delete
  - <resource-type>/<action name> for actions
  - default-category: default category for the given resource type (optional)
  - default-severity: default severity for the given resource type (optional) (+ some additional keys, to be decided, to filter only certain response status codes (when a response is passed in), and to specify what should end up in the payload field.

Event creation utility fn

An event creation utility fn should be added; use sixsq.nuvla.server.resources.event.utils/create-event as starting point. The utility fn should:

check if eventing is enabled for the given resource and operation, by checking the registry described above;
create the event using the defaults and the directions given in the registry and store it in ES;
back-in a retry mechanism (see section on consistency below).

CRUD operations

Make the generation of events more systematic, by plugging-in the event generation as part of the crud operations. Have sane defaults, but give possibility to override by resource type (and maybe by event type?). Modify the standard crud implementation sixsq.nuvla.server.resources.common.std-crud for add, edit and delete operations to also call the event creation utility fn.

Actions

It would be good to make eventing more systematic also for actions other then crud.

Proposal is to:

create an event immediately at the beginning of the implementation of each action, to keep a trace of the domain request. The action impl should call the event creation utility fn above explicitly, or a facade to it tailored for actions. Or maybe we can write an action impl wrapper that calls the event creation utility fn.
when other resources are involved, trigger other events at the sub-resources level, and set the above event as parent event (in the initial prototype I added a simple macro with-parent-event to make this easier).

Bulk actions

Single or multiple (one per resource) events or both for bulk actions ?

single bulk action event

pros:
- the domain intent is preserved
cons:
- a search for events on a specific resource that is impacted by the bulk action would not return an event
multiple events, one per resource impacted by the bulk action

pros:
- history of events on a specific impacted resource would have a trace of the action
cons:
- no trace of the bulk action
both, a bulk action level event and one event per resource It would be good to keep a link, like a parent field on the resource level events pointing to the bulk action event

pros:
- domain intent represented with the bulk action event, and also history of each resource clearly represented
cons:
- more verbose, less space efficient

I would go for the third option of registering both a top level event and one event per resource linked to the top level one, for maximum traceability. For bulk actions impacting hundreds or thousands of resources this would be inefficient: if this happens to be the case we should make the policy configurable via the event config registry.

Async actions

When an async action is triggered via the api-server, an event can be generated to signal that the processing has started, but it should be made clear that the action is not finished. And it should be the responsibility of the action runner/job engine/external system to create an event once the action is completed. Those events could be linked though, via the parent and/or correlation-id properties.

Direct event creation

Regardless of how generic we make the above mechanism, there will be cases where it does not fit. In these cases it might make sense to create events directly, by-passing the automatic creation and the resource event registry. There should not be too many such cases though, otherwise the utility of the generic method above is in question.

Events searching and filtering

Given the above, it should be possible for a user with the proper permissions to:

list all the events linked to a specific resource, by filtering by resource.href
list all the events linked to a specific resource type, by filtering by resource.type
list all the events linked to a specific user, session or claim
list all the events of a given type
fetch child events; for example fetching the children of a deployment start event would give the corresponding job creation and job completion events.

Notifications

TODO: clarify how notifications work in Nuvla. As far as I can tell, there is currently no link between notifications and events. It might make sense to have an optional event-href property pointing to the an event to which a notification might refer.

Remove all present calls to `create-event`

Configure the registry in such a way that all current events are still logged, with all the information that is currently logged.

Immutability

Events should be immutable, edit and delete should not be allowed (at least for standard user roles).

Events retention

Events are retained in ES for 1 year, then they are deleted. The current retention period seems reasonable, but it would be good to re-evaluate once many more events are being generated, based on the load on ES.

Worth also considering: instead of deleting old events, we might want to zip them up and send them to S3 or similar cheap service for cold storage, and have a much longer retention period in that form.

Migration of existing events

how to migrate ?
impacts on ui or other components ? who is making use of events?
is any customer making use of events already?

Note on consistency

With the proposal above, we can think of a couple of undesirable scenarios:

an event is committed but the corresponding action fails
an action succeeds, but the corresponding event fails to commit

Ideally events and the actions they represent should be committed in an atomic transaction. Elastic Search, which is where resources are currently being stored, does not support atomic transactions. Also, some actions do not interact with Elastic Search, so making them atomic with the event creation would be challenging anyway.

Possible approaches to the above issue coming to mind:

Perform the action first, and only if the action succeeds, then try to commit the event. If commit of the event fails then try a few more times with exponential backoff; if that also fails then log a SEVERE error in the logs.

Pros:
- simple to implement
- low impact on existing logic
Cons:
- does not insulate from a loss of power event, or ES failure, or network failure, occurring between the end of the action and the storing of the event: in that case we could have an action being successfully performed without a corresponding event logged anywhere in the system.
- requires an active monitoring of the logs to be effective (SEVERE errors should be notified to admins)
Commit the event first, and only if that succeeds then perform the action. If there is a failure while performing the action, we might think of deleting the event: but the event might have been consumed by other parties in the meantime, so that should not be considered as an option. An alternative could be to emit a compensating event, basically erasing the validity of the previous event.

Pros:
- simple to implement
Cons:
- there is still much potential for inconsistencies in case of unrecoverable failures in between any of the 3 phases (event commit - action - compensating event commit)
- handling with compensating events might be very complex, and this complexity is shifted here to downstream event consumers, which arguably should not be bothered with this kind of issues
Commit the event first and perform the action asynchronously, with a retry mechanism. Most interactions with the system are made via synchronous http calls, which are expected to return only once the action is completed. As such, this approach would require changes to CRUD operations and other actions to wait for the asynchronous result to be ready, before responding to the client.

Pros:
- does insulate from a loss of power event, even if it occurs between the storing of the event and the end of the action
Cons:
- consistency is "eventual": there is a period of time in which there is an event in the system saying that an action has been performed, while that has not happened yet.
- implementation complexity: this approach would be in the direction of building an event driven architecture, designed for asynchronous processing and eventual consistency. Which can have some advantages but at the cost of much increased complexity, because every component of the system should be designed taking the eventual consistency principle into account.
Other ?

I have not considered the option of issuing compensating actions in case of failures in storing events, because it seems unreasonable for example to start an app deployment only to cancel it a few ms later, only because it was not possible to log the corresponding event.

Of the options above, option 1. seems to me like the best compromise for the Nuvla system at the moment. Maybe a solution like 3., or a similar form of event driven architecture, could be considered in the future?

0xbase12 commented 1 year ago

To keep a fully re-playble history of events as in event sourcing.

This is an interesting proposition to be able to trace events. But I would suggest to add it in a second phase.

0xbase12 commented 1 year ago

Event config registry

I advise to make it a defmulti to get the config event for each resource. Because we load some resources dynamically from other projects jar. If the config of the resource is defined in each resource namespace, it will be loaded also dynamically at runtime. I think it’s better than having to configure resources that will maybe be loaded later. The other solution is to have function that each resource call at initialization that add event config to the atom registry. But I personally I prefer the defmulti.

0xbase12 commented 1 year ago

Maybe move it under content ? Clarify how content is used in general.

what is the advantage of nesting instead keeping a flat document

0xbase12 commented 1 year ago

Set acl in such a way that the event is visible for the user/group that created it.

User need at least view-data rights to be able to see the content of the event. (view-meta right give access only to name description common attributes) I'm thinking if it's better to give view-data on event to all users that are able to see the content of the resource that is triggering the event. (e.g. nuvlabox resource user/a action call on it should be also visible to all users that can see this nuvlabox

0xbase12 commented 1 year ago

Migration of existing events

• how to migrate ? We can create a migration script from old schema to the new one. But we can do it at the very end. Do not bother with that at this step. • impacts on ui or other components ? who is making use of events? UI / notification system / api-server-private • is any customer making use of events already? Customers can see events in UI. I think that’s all usage that is being done from events by customers.

0xbase12 commented 1 year ago

Note on consistency. Possible approaches to the above issue coming to mind:

Other suggestion • Commit the event first, if event creation fail log it but do not stop the original request. • If there is a failure while performing the action, we don’t delete the event, we create a new event that state action failure.

0xbase12 commented 1 year ago

payload: field to store arbitrary data for the event.

What is the spec definition for this field? Because ES doesn't play well with open maps.

alebellu commented 1 year ago

Event config registry

I advise to make it a defmulti to get the config event for each resource. Because we load some resources dynamically from other projects jar. If the config of the resource is defined in each resource namespace, it will be loaded also dynamically at runtime. I think it’s better than having to configure resources that will maybe be loaded later. The other solution is to have function that each resource call at initialization that add event config to the atom registry. But I personally I prefer the defmulti.

Makes sense. Will change it to defmulti, thanks.

alebellu commented 1 year ago

• impacts on ui or other components ? who is making use of events? UI / notification system / api-server-private

Ok, then given all these impacted systems, I guess it is better if the event spec is backward-compatible ? At least until all other components have migrated.

alebellu commented 1 year ago

payload: field to store arbitrary data for the event.

What is the spec definition for this field? Because ES doesn't play well with open maps.

I took the spec of ::callback/data as blueprint:

(s/def ::payload
  (-> (st/spec (su/constrained-map keyword? any?))
      (assoc :name "payload"
             :json-schema/type "map"
             :json-schema/description "event payload"
             :json-schema/indexed false)))

The idea would be that this field is a dump of whatever might be useful, but should not be indexed by ES. If we need additional info to be indexed they should be added to the spec. For example in the prototype I moved the state property inside payload: the idea would be that you search by event-type instead. (but I will put it back for backward compatibility, see comment above).

I didn't see issues with it in my limited tests,do you foresee some problems with the spec above ?

0xbase12 commented 1 year ago

payload: field to store arbitrary data for the event.

What is the spec definition for this field? Because ES doesn't play well with open maps.

I took the spec of ::callback/data as blueprint:
(s/def ::payload
  (-> (st/spec (su/constrained-map keyword? any?))
      (assoc :name "payload"
             :json-schema/type "map"
             :json-schema/description "event payload"
             :json-schema/indexed false)))
The idea would be that this field is a dump of whatever might be useful, but should not be indexed by ES. If we need additional info to be indexed they should be added to the spec. For example in the prototype I moved the state property inside payload: the idea would be that you search by event-type instead. (but I will put it back for backward compatibility, see comment above).

I didn't see issues with it in my limited tests,do you foresee some problems with the spec above ?

Interesting suggestion. We will have to check that ES doesn't count the keys in case we set the field as not indexed.

alebellu commented 1 year ago

Maybe move it under content ? Clarify how content is used in general.

what is the advantage of nesting instead keeping a flat document

Ok, for some reason I thought that the keyword :content was treated in a special way that I was not grasping :-) I would have just 2 properties under content, resource and payload, so I don't see much value in grouping them under :content. On the other hand if we need to add many additional fields (for ES indexing) then I can see the value of doing that.

alebellu commented 1 year ago

ES doesn't count the keys

What do you mean ? Sorry,my understanding of how ES works is pretty basic..

alebellu commented 1 year ago

Note on consistency. Possible approaches to the above issue coming to mind:

Other suggestion • Commit the event first, if event creation fail log it but do not stop the original request. • If there is a failure while performing the action, we don’t delete the event, we create a new event that state action failure.

Ok, in this scenario I would also create an event when the action succeeds. Otherwise if you are listening to events live, you have no way of distinguishing if an action started but is still in progress vs an action started and succeeded.

0xbase12 commented 1 year ago

payload: field to store arbitrary data for the event.

What is the spec definition for this field? Because ES doesn't play well with open maps.

I took the spec of ::callback/data as blueprint:
(s/def ::payload
  (-> (st/spec (su/constrained-map keyword? any?))
      (assoc :name "payload"
             :json-schema/type "map"
             :json-schema/description "event payload"
             :json-schema/indexed false)))
The idea would be that this field is a dump of whatever might be useful, but should not be indexed by ES. If we need additional info to be indexed they should be added to the spec. For example in the prototype I moved the state property inside payload: the idea would be that you search by event-type instead. (but I will put it back for backward compatibility, see comment above). I didn't see issues with it in my limited tests,do you foresee some problems with the spec above ?
Interesting suggestion. We will have to check that ES doesn't count the keys in case we set the field as not indexed.

I tested that we can fit a big number of keys in payload and seems to be working when not indexed in ES. 👏

0xbase12 commented 1 year ago

ES doesn't count the keys

What do you mean ? Sorry,my understanding of how ES works is pretty basic..

I mean that ES limit indexed fields to 1000 keys per index by default.

Here is an example of an error if we index the attribute payload and we register more that 1000 keys over the documents: "Limit of total fields [1000] in index [nuvla-event] has been exceeded"

But this is no more relevant, since when the payload attribute is not indexed ES doesn't account payload nested keys as part of that limit.

alebellu commented 1 year ago

ES doesn't count the keys

What do you mean ? Sorry,my understanding of how ES works is pretty basic..

I mean that ES limit indexed fields to 1000 keys per index by default.

Here is an example of an error if we index the attribute payload and we register more that 1000 keys over the documents: "Limit of total fields [1000] in index [nuvla-event] has been exceeded"

But this is no more relevant, since when the payload attribute is not indexed ES doesn't account payload nested keys as part of that limit.

Interesting, thanks.

alebellu commented 1 year ago

[Update 11/08/2023]

After several code iterations I settled on some design decisions that I try to detail below.

For the implementation and latest changes, see the linked PR.

Event types

All events must have an event type:

An event type is a string identifier, and implicitly a specification for a set of event objects that have the same semantic intent and same structure.

Event types must belong to a category, which is a way of grouping sets of event types with similar or related semantic. They can also have a subcategory.

Several public implementation of event systems use event types.

Sometimes it is called event name, sometimes it is called event type. ##### Stripe See https://stripe.com/docs/api/events/types

Examples

| Event type | Object | Description | |--------------------------------|---------------------------------|-----------------------------------------------------------------------------------------| | account.application.authorized | data.object is an "application" | Occurs whenever a user authorizes an application. Sent to the related application only. | | charge.expired | data.object is a charge | Occurs whenever an uncaptured charge expires. | | charge.failed | data.object is a charge | Occurs whenever a failed charge attempt occurs. | | charge.pending | data.object is a charge | Occurs whenever a pending charge is created. | | charge.refund.updated | data.object is a refund | Occurs whenever a refund is updated, on selected payment methods. | | charge.refunded | data.object is a charge | Occurs whenever a charge is refunded, including partial refunds. |

##### AWS See https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html

Examples

```json { "version": "0", "id": "cd4d811e-ab12-322b-8255-872ce65b1bc8", "detail-type": "event type", "source": "aws.config", "account": "111122223333", "time": "2018-03-22T00:38:11Z", "region": "us-east-1", "resources": [ resources ], "detail": { specific message type } } ``` ```json { "version":"0", "id":"2f8147ab-8c48-47c6-b0b6-3ee23ec8d300", "detail-type":"EMR Auto Scaling Policy State Change", "source":"aws.emr", "account":"123456789012", "time":"2016-12-16T20:42:44Z", "region":"us-east-1", "resources":[], "detail":{ "resourceId":"ig-X2LBMHTGPCBU", "clusterId":"j-1YONHTCP3YZKC", "state":"PENDING", "message":"AutoScaling policy modified by user request", "scalingResourceType":"INSTANCE_GROUP" } } ``` ```json { "version": "0", "id": "ffd8a6fe-32f8-ef66-c85c-111111111111", "detail-type": "Tag Change on Resource", "source": "aws.tag", "account": "123456789012", "time": "2018-09-18T20:41:06Z", "region": "us-east-1", "resources": [ "arn:aws:ec2:us-east-1:123456789012:instance/i-0000000aaaaaaaaaa" ], "detail": { "changed-tag-keys": [ "key2", "key3" ], "service": "ec2", "resource-type": "instance", "version": 5, "tags": { "key4": "value4", "key1": "value1", "key2": "value2" } } } ```

##### GCP See for example: https://cloud.google.com/eventarc/docs/event-types

Examples

- `google.cloud.apigateway.api.v1.created` - `google.cloud.apigateway.api.v1.deleted` - `google.cloud.apigateway.api.v1.updated` - `google.cloud.gkebackup.backup.v1.created` - `google.cloud.gkebackup.backup.v1.deleted` - `google.cloud.gkebackup.backup.v1.updated` - `google.cloud.gkebackup.restore.v1.created` - `google.cloud.gkebackup.restore.v1.deleted` - `google.cloud.gkebackup.restore.v1.updated` - `google.cloud.storage.object.v1.archived` - `google.cloud.storage.object.v1.deleted` - `google.cloud.storage.object.v1.finalized` - `google.cloud.storage.object.v1.metadataUpdated` - `google.cloud.dataflow.job.v1beta3.statusChanged`

In api-server, for every crud operation or action we define the following event types:

one event type to represent the fact that a request is received for the given crud operation or action
one event type to represent the fact that the given crud operation or action has completed its execution without errors
one event type to represent the fact that the given crud operation or action has failed

For example, the successful update of a nuvlabox resource generates 2 events:

nuvlabox.update
nuvlabox.update.completed

If the operation above fails for some reason, it will generate the following events instead:

nuvlabox.update
nuvlabox.update.failed

Event details

Based on the event type (or its category), an event can contain additional details, which are grouped under the key details.

For example, events of category state must have a key new-state nested under details, and optionally a key old-state.

Event parent

In the example above the 2 events generated from the creation of a nuvlabox resource have no apparent link, other than the resource to which they refer to.

To make the link between the events explicit we use the property parent.

In the examples above the events nuvlabox.update.completed and nuvlabox.update.failed would have the event nuvlabox.update as their parent.

Some actions trigger other actions or operations internally, which in turn generate other events: this situation gives rise to a tree of events, where the root of the tree is the event representing the outermost request.

ACL

The acl on events is set with the following algorithm:

if the event references a specific resource
- if the resource exists
- if the user has ::view-data access to the resource
  - set the event acl equal to that of the resource
- else
  - if the event type has the allow-anon flag
  - set the event owner equal to the active claim or to group/nuvla-admin if active claim is not present
  - ~~else throw an authorization exception~~: making event creation fail due to permission issues makes the system too prone to errors, as I experienced checking events on lifecycle tests. Instead we want the event system to be stable and reliable. So, changing this to =>
  - else still allow event creation, but set the owner to group/nuvla-admin. (if we keep this approach, we can later remove the allow-anon flag, which is made redundant).
- else (if the resource does not exist)
- allow the creation of the event, but set the event owner to the active claim or to group/nuvla-admin if active claim is present
else (if the event does not reference a specific resource)
- set the event owner to the active claim or to group/nuvla-admin if active claim is not present

Implementation details

Event manager

CRUD operations on events are redirected to an event manager.

The event manager has a generic interface and the actual implementation can be switched (I used db bindings and payments interface as blueprints for this).

This makes it possible to easily switch from an implementation to another. For now there is only one main implementation (DbEventManager) that stores the events in Elastic Search and sends them to Kafka at the same time. There is also a very simple implementation (NoEventsEventManager) that does not store events.

The event manager also offers some wrapper functions for crud operations. The idea is that crud operations can be triggered from code with events enabled or without events enabled, depending on the situation.

CRUD with events facade

New facade functions to invoke crud operations or actions with events generation enabled:

crud/add-with-events
crud/edit-with-events
crud/delete-with-events
crud/do-action-with-events

Per-resource type enabling/disabling

Events can also be enabled or disabled per resource type by redefining the multimethod sixsq.nuvla.events.config/events-enabled?

Also add per-event-type and per-category enabling/disabling of events? It can probably be useful, I would just wait for the first use-case before implementing it.

Other multi methods

sixsq.nuvla.events.config/supported-event-types: Return the event types supported by the given resource type, along with their category, and possibly subcategory and other flags (for now only allow-anon).
sixsq.nuvla.events.config/details-spec: Returns the spec for the details field for the given event type.
sixsq.nuvla.events.config/category-default-details-spec: Returns the default spec for the details field for event types of the given category.

Failure mode

After further thinking, I am still a bit reluctant about letting an operation proceed when there is an error storing an event. If later on we want to build other functionalities on top of the event system, we must be able to rely on the fact that every operation/action is accompanied by its corresponding events in the system.

I would do the following:

if there is a failure storing the initial event which traces the operation request, then we make the whole operation fail. In this case the client is expected to repeat the call.
if there is a failure in storing any subsequent events, then we log the failure but let the operation continue.

Such that:
every client call will be traced in the event logs, at least with the initial event
we can investigate issues out of band by looking for suspect events witch have an opening but not a closing (with either failure or success).

In any case, it is very important for the event creation code to be made simple and free of logical errors as much as possible: to be clear, it should only fail if the events are malformed, or if ES or the network etc. fails.

Event testing

Since operations and actions on any resource will generate events by default, most of the lifecycle tests can be integrated with checks on the generated events.

I started doing this work for the module, the deployment and the nuvlabox0 resources, but it is a long way from being complete. I added a macro sixsq.nuvla.server.resources.lifecycle-test-utils/are-events to make the testing of events easier: it checks the event type, the resource, and the hierarchy of the events.

Plan for next week

failure mode as explained above
add support for bulk events

Plan for later

root property: currently we are filling in the parent property which has the same meaning of a causation id in event sourcing. It would be good to also add a root property, so that all events in a conversation can be retrieved in a single query (root, children, grand-children, etc. will all have the same value on the root property). Such a root property would perform the same function as a correlation id in event sourcing.

I think it can be an optional property in the common properties of all resources: since parent is defined there, the notion of root should also make sense in general.
continue adding checks for events in resource lifecycle tests : this activity can take much time if we do it for most resources
list of categories is currently static in the event spec: make it dynamic, based on registered event types
in tests change are-events macro to keep a pointer to the last consumed event, and consume all remaining events when called, instead of using the count from the expected
retry mechanism for event storing ?

0xbase12 commented 1 year ago

Some enhancement ideas:

Group session-id, user-id, active-claim in a single nested attribute equivalent to authn-info of the request.

konstan commented 11 months ago

Reason for Pending: work on Projects is impeding the progress.

nuvla / api-server