Open cjdcordeiro opened 3 years ago
Current Event resource schema:
::timestamp date-time?
::category #{"state" "alarm" "action" "system"}
::severity #{"critical" "high" "medium" "low"})
::content/resource/href id?
::content/state string?
~Changes:~ ~- triggered-by: user? (user causing the event creation)~ ~- parent: id? instead of content/resource/href~ ~- state string opt:string? ~ ~- new-state: opt:string? (only for category "state")~
~Events specification~ ~- Immutable (once created never change, edit not allowed~ ~- Event acl: Event are owned by group/nuvla-admin. Views rights are extracted from the concerned resource.~
Creation conditions:
Creation of Events are implemented server side.
Condition for deleting events:
Suggestion reuse existing event resource in Nuvla
When we should register an event:
To be completed
From my current understanding, we would like to have events to represent both:
Below I analyze only the first kind of events (internal to the api-server). It should be the responsibility of the external components (the job engine, etc.) to signal interesting events (but the api-server can provide a means of doing it, for example by allowing creation of events via http calls).
One of: ["state" "alarm" "action" "system" "user" "email"]
One of: ["critical" "high" "medium" "low"]
Current way to add an event is by calling sixsq.nuvla.server.resources.event.utils/create-event
.
Example of event created:
{:resource-type "event",
:content {:resource {:href "deployment/847a79a2-993a-443a-bc4e-506d7424e1e9"},
:state "deployment/847a79a2-993a-443a-bc4e-506d7424e1e9 created"},
:severity "medium",
:category "action",
:timestamp "2023-07-24T13:37:55.464Z",
:acl {:owners ["user/jane"]}}
which is created from this creation payload:
{:params {:resource-name "event"},
:body {:resource-type "event",
:content {:resource {:href "deployment/847a79a2-993a-443a-bc4e-506d7424e1e9"},
:state "deployment/847a79a2-993a-443a-bc4e-506d7424e1e9 created"},
:severity "medium",
:category "action",
:timestamp "2023-07-24T13:37:55.464Z",
:acl {:owners ["user/jane"]}},
:nuvla/authn {:user-id "internal",
:active-claim "group/nuvla-admin",
:claims #{"group/nuvla-admin" "group/nuvla-anon" "group/nuvla-user"}}}
Sometimes actions on a resource trigger crud operations on other resources,
for example deployment-set create
triggers crud/add
on a job
resource.
Other times an action does not trigger any crud operation, e.g deployment-set plan
.
And in other cases an action can have impact on the resource it is being called upon, and on other resources
as well; e.g deplyment-set start
edits the deployment-set resource and can create a new job as well.
Events are retained in ES for 1 year, then they are deleted.
event-utils/create-event
:
resource-href
passed to create-event
is the resource id most of the time, but not always
(e.g for deployment-parameter
it is the parent id)state
(second parameter to create-event
) is currently an action map or a messageacl
is sometimes that of the resource, other times is the default acl for the current userevent-utils/create-event
is in turn called by either:
crud/add
, crud/edit
, crud/delete
)crud/do-action
Even though this is not a goal at the moment, I think it would be a good a idea to provide enough information in the events that we decide to track to make them re-playable, unless it complicates the code too much. It would be less space efficient than only keeping track of essential attributes, but it would give full traceability, and would also keep all options open regarding the possible uses that we might want to make of registered events.
change the spec of the event
resource type:
event-type
field, to uniquely identity the type of eventresource-created
, resource-updated
, resource-deleted
<resource-type>/<action name>
<resource-type>/bulk-<action name>
This field can be useful for filtering events.
Maybe move it under content
? Clarify how content
is used in general.
session-id
, user-id
and active-claim
.
Reuse utility fns in sixsq.nuvla.auth.utils
to fill them in:
session-id <- (auth/auth/current-session-id request)
user-id <- (auth/current-user-id request)
active-claim <- (auth/current-active-claim request)
Maybe move it under content
? Clarify how content
is used in general.
message
fieldMaybe move it under content
? Clarify how content
is used in general.
content
with required keysresource.href
: the href of the specific resource involvedresource.type
: the type of resource involvedpayload
: field to store arbitrary data for the event.
This field would replace the existing state
field (a state change might be represented as part of the payload).
The format will depend on the event-type. For example:
resource-created
: store the request payload, but also the full resource created, including the generated attributesresource-updated
: store the request payload, and the fields that end up being updated
Exactly what to store in the payload
field might be customizable via the resource event registry when it makes sense to do so.parent
: there is already an optional parent
property in the common attributes of all resources.
I would use it for events to point to the parent event when relevant (see below).correlation-id
: Maybe add this later, if we see the need.
It is a concept defined in the event sourcing/CQRS domain, it might be useful in case of long chain of events,
to easily track an entire conversation.
See for example:Basically every event is suggested to have 3 ids:
In these terms, the parent
property above would represent the Causation ID.
[UPDATE 02/08/2023] In the first iteration I tried to get rid of the :content
top-level key, to see what happens.
Resulting top-level spec would be something like:
(s/def ::resource
(-> (st/spec (su/only-keys :req-un [::href]
:opt-un [::common/resource-type]))))
(s/def ::schema
(su/only-keys-maps common/common-attrs
{:req-un [::timestamp
::event-type
::message
::content {:req-un [::resourcce {:req-un [::type, ::href]}
::payload]}
::category
::severity
::user-id
::active-claim]
:opt-un [::session-id]}))
Set acl in such a way that the event is visible for the user/group that created it.
Something like this ?
{:owners ["group/nuvla-admin"]
:view-meta [(auth/current-active-claim request)]}
Is this enough? Not sure yet..
Have a global registry of events configurations, keyed by resource type.
event-config-registry
atom, as a map of this form:
<crud operation id>
or <action id>
enabled
: true
if the crud operation or action should trigger events for the given resource type. Default true
.event-type
: the default event type
For example:resource-created
for crud/add
resource-updated
for crud/edit
resource-deleted
for crud/delete
<resource-type>/<action name>
for actionsdefault-category
: default category for the given resource type (optional)default-severity
: default severity for the given resource type (optional)
(+ some additional keys, to be decided, to filter only certain response status codes (when a response is passed in), and to specify what should end up in the payload
field.An event creation utility fn should be added; use sixsq.nuvla.server.resources.event.utils/create-event
as starting point.
The utility fn should:
Make the generation of events more systematic, by plugging-in the event generation as part of the crud operations.
Have sane defaults, but give possibility to override by resource type (and maybe by event type?).
Modify the standard crud implementation sixsq.nuvla.server.resources.common.std-crud
for add
, edit
and delete
operations to also call the event creation utility fn.
It would be good to make eventing more systematic also for actions other then crud.
Proposal is to:
with-parent-event
to make this easier).Single or multiple (one per resource) events or both for bulk actions ?
single bulk action event
pros:
cons:
multiple events, one per resource impacted by the bulk action
pros:
cons:
both, a bulk action level event and one event per resource
It would be good to keep a link, like a parent
field on the resource level events pointing to the bulk action event
pros:
cons:
I would go for the third option of registering both a top level event and one event per resource linked to the top level one, for maximum traceability. For bulk actions impacting hundreds or thousands of resources this would be inefficient: if this happens to be the case we should make the policy configurable via the event config registry.
When an async action is triggered via the api-server, an event can be generated to signal that the processing has started, but it should be made clear that the action is not finished. And it should be the responsibility of the action runner/job engine/external system to create an event once the action is completed. Those events could be linked though, via the parent and/or correlation-id properties.
Regardless of how generic we make the above mechanism, there will be cases where it does not fit. In these cases it might make sense to create events directly, by-passing the automatic creation and the resource event registry. There should not be too many such cases though, otherwise the utility of the generic method above is in question.
Given the above, it should be possible for a user with the proper permissions to:
resource.href
resource.type
TODO: clarify how notifications work in Nuvla.
As far as I can tell, there is currently no link between notifications and events.
It might make sense to have an optional event-href
property pointing to the an event to which a notification might refer.
create-event
Configure the registry in such a way that all current events are still logged, with all the information that is currently logged.
Events should be immutable, edit and delete should not be allowed (at least for standard user roles).
Events are retained in ES for 1 year, then they are deleted. The current retention period seems reasonable, but it would be good to re-evaluate once many more events are being generated, based on the load on ES.
Worth also considering: instead of deleting old events, we might want to zip them up and send them to S3 or similar cheap service for cold storage, and have a much longer retention period in that form.
With the proposal above, we can think of a couple of undesirable scenarios:
Ideally events and the actions they represent should be committed in an atomic transaction. Elastic Search, which is where resources are currently being stored, does not support atomic transactions. Also, some actions do not interact with Elastic Search, so making them atomic with the event creation would be challenging anyway.
Possible approaches to the above issue coming to mind:
Perform the action first, and only if the action succeeds, then try to commit the event. If commit of the event fails then try a few more times with exponential backoff; if that also fails then log a SEVERE error in the logs.
Pros:
Cons:
Commit the event first, and only if that succeeds then perform the action. If there is a failure while performing the action, we might think of deleting the event: but the event might have been consumed by other parties in the meantime, so that should not be considered as an option. An alternative could be to emit a compensating event, basically erasing the validity of the previous event.
Pros:
Cons:
Commit the event first and perform the action asynchronously, with a retry mechanism. Most interactions with the system are made via synchronous http calls, which are expected to return only once the action is completed. As such, this approach would require changes to CRUD operations and other actions to wait for the asynchronous result to be ready, before responding to the client.
Pros:
Cons:
I have not considered the option of issuing compensating actions in case of failures in storing events, because it seems unreasonable for example to start an app deployment only to cancel it a few ms later, only because it was not possible to log the corresponding event.
Of the options above, option 1. seems to me like the best compromise for the Nuvla system at the moment. Maybe a solution like 3., or a similar form of event driven architecture, could be considered in the future?
To keep a fully re-playble history of events as in event sourcing.
This is an interesting proposition to be able to trace events. But I would suggest to add it in a second phase.
Event config registry
I advise to make it a defmulti to get the config event for each resource. Because we load some resources dynamically from other projects jar. If the config of the resource is defined in each resource namespace, it will be loaded also dynamically at runtime. I think it’s better than having to configure resources that will maybe be loaded later. The other solution is to have function that each resource call at initialization that add event config to the atom registry. But I personally I prefer the defmulti.
Maybe move it under content ? Clarify how content is used in general.
what is the advantage of nesting instead keeping a flat document
Set acl in such a way that the event is visible for the user/group that created it.
User need at least view-data rights to be able to see the content of the event. (view-meta right give access only to name description common attributes) I'm thinking if it's better to give view-data on event to all users that are able to see the content of the resource that is triggering the event. (e.g. nuvlabox resource user/a action call on it should be also visible to all users that can see this nuvlabox
Migration of existing events
• how to migrate ? We can create a migration script from old schema to the new one. But we can do it at the very end. Do not bother with that at this step. • impacts on ui or other components ? who is making use of events? UI / notification system / api-server-private • is any customer making use of events already? Customers can see events in UI. I think that’s all usage that is being done from events by customers.
Note on consistency. Possible approaches to the above issue coming to mind:
payload: field to store arbitrary data for the event.
What is the spec definition for this field? Because ES doesn't play well with open maps.
Event config registry
I advise to make it a defmulti to get the config event for each resource. Because we load some resources dynamically from other projects jar. If the config of the resource is defined in each resource namespace, it will be loaded also dynamically at runtime. I think it’s better than having to configure resources that will maybe be loaded later. The other solution is to have function that each resource call at initialization that add event config to the atom registry. But I personally I prefer the defmulti.
Makes sense. Will change it to defmulti, thanks.
• impacts on ui or other components ? who is making use of events? UI / notification system / api-server-private
Ok, then given all these impacted systems, I guess it is better if the event spec is backward-compatible ? At least until all other components have migrated.
payload: field to store arbitrary data for the event.
What is the spec definition for this field? Because ES doesn't play well with open maps.
I took the spec of ::callback/data
as blueprint:
(s/def ::payload
(-> (st/spec (su/constrained-map keyword? any?))
(assoc :name "payload"
:json-schema/type "map"
:json-schema/description "event payload"
:json-schema/indexed false)))
The idea would be that this field is a dump of whatever might be useful, but should not be indexed by ES.
If we need additional info to be indexed they should be added to the spec.
For example in the prototype I moved the state
property inside payload: the idea would be that you search by event-type
instead. (but I will put it back for backward compatibility, see comment above).
I didn't see issues with it in my limited tests,do you foresee some problems with the spec above ?
payload: field to store arbitrary data for the event.
What is the spec definition for this field? Because ES doesn't play well with open maps.
I took the spec of
::callback/data
as blueprint:(s/def ::payload (-> (st/spec (su/constrained-map keyword? any?)) (assoc :name "payload" :json-schema/type "map" :json-schema/description "event payload" :json-schema/indexed false)))
The idea would be that this field is a dump of whatever might be useful, but should not be indexed by ES. If we need additional info to be indexed they should be added to the spec. For example in the prototype I moved the
state
property inside payload: the idea would be that you search byevent-type
instead. (but I will put it back for backward compatibility, see comment above).I didn't see issues with it in my limited tests,do you foresee some problems with the spec above ?
Interesting suggestion. We will have to check that ES doesn't count the keys in case we set the field as not indexed.
Maybe move it under content ? Clarify how content is used in general.
what is the advantage of nesting instead keeping a flat document
Ok, for some reason I thought that the keyword :content
was treated in a special way that I was not grasping :-)
I would have just 2 properties under content
, resource
and payload
, so I don't see much value in grouping them under :content
.
On the other hand if we need to add many additional fields (for ES indexing) then I can see the value of doing that.
ES doesn't count the keys
What do you mean ? Sorry,my understanding of how ES works is pretty basic..
Note on consistency. Possible approaches to the above issue coming to mind:
- Other suggestion • Commit the event first, if event creation fail log it but do not stop the original request. • If there is a failure while performing the action, we don’t delete the event, we create a new event that state action failure.
Ok, in this scenario I would also create an event when the action succeeds. Otherwise if you are listening to events live, you have no way of distinguishing if an action started but is still in progress vs an action started and succeeded.
payload: field to store arbitrary data for the event.
What is the spec definition for this field? Because ES doesn't play well with open maps.
I took the spec of
::callback/data
as blueprint:(s/def ::payload (-> (st/spec (su/constrained-map keyword? any?)) (assoc :name "payload" :json-schema/type "map" :json-schema/description "event payload" :json-schema/indexed false)))
The idea would be that this field is a dump of whatever might be useful, but should not be indexed by ES. If we need additional info to be indexed they should be added to the spec. For example in the prototype I moved the
state
property inside payload: the idea would be that you search byevent-type
instead. (but I will put it back for backward compatibility, see comment above). I didn't see issues with it in my limited tests,do you foresee some problems with the spec above ?Interesting suggestion. We will have to check that ES doesn't count the keys in case we set the field as not indexed.
I tested that we can fit a big number of keys in payload and seems to be working when not indexed in ES. 👏
ES doesn't count the keys
What do you mean ? Sorry,my understanding of how ES works is pretty basic..
I mean that ES limit indexed fields to 1000 keys per index by default.
Here is an example of an error if we index the attribute payload
and we register more that 1000 keys over the documents: "Limit of total fields [1000] in index [nuvla-event] has been exceeded"
But this is no more relevant, since when the payload attribute is not indexed ES doesn't account payload nested keys as part of that limit.
ES doesn't count the keys
What do you mean ? Sorry,my understanding of how ES works is pretty basic..
I mean that ES limit indexed fields to 1000 keys per index by default.
Here is an example of an error if we index the attribute
payload
and we register more that 1000 keys over the documents: "Limit of total fields [1000] in index [nuvla-event] has been exceeded"But this is no more relevant, since when the payload attribute is not indexed ES doesn't account payload nested keys as part of that limit.
Interesting, thanks.
After several code iterations I settled on some design decisions that I try to detail below.
For the implementation and latest changes, see the linked PR.
All events must have an event type:
An event type is a string identifier, and implicitly a specification for a set of event objects that have the same semantic intent and same structure.
Event types must belong to a category
, which is a way of grouping sets of event types with similar or related semantic.
They can also have a subcategory
.
In api-server, for every crud operation or action we define the following event types:
For example, the successful update of a nuvlabox
resource generates 2 events:
nuvlabox.update
nuvlabox.update.completed
If the operation above fails for some reason, it will generate the following events instead:
nuvlabox.update
nuvlabox.update.failed
Based on the event type (or its category), an event can contain additional details, which are grouped under the key details
.
For example, events of category state
must have a key new-state
nested under details
, and optionally a key old-state
.
In the example above the 2 events generated from the creation of a nuvlabox
resource have no apparent link,
other than the resource to which they refer to.
To make the link between the events explicit we use the property parent
.
In the examples above the events nuvlabox.update.completed
and nuvlabox.update.failed
would have the event nuvlabox.update
as their parent.
Some actions trigger other actions or operations internally, which in turn generate other events: this situation gives rise to a tree of events, where the root of the tree is the event representing the outermost request.
The acl on events is set with the following algorithm:
allow-anon
flag group/nuvla-admin
if active claim is not presentgroup/nuvla-admin
. (if we keep this approach,
we can later remove the allow-anon
flag, which is made redundant).group/nuvla-admin
if active claim is presentgroup/nuvla-admin
if active claim is not presentCRUD operations on events are redirected to an event manager.
The event manager has a generic interface and the actual implementation can be switched (I used db bindings and payments interface as blueprints for this).
This makes it possible to easily switch from an implementation to another.
For now there is only one main implementation (DbEventManager
) that stores the events in Elastic Search and sends them
to Kafka at the same time.
There is also a very simple implementation (NoEventsEventManager
) that does not store events.
The event manager also offers some wrapper functions for crud operations. The idea is that crud operations can be triggered from code with events enabled or without events enabled, depending on the situation.
New facade functions to invoke crud operations or actions with events generation enabled:
crud/add-with-events
crud/edit-with-events
crud/delete-with-events
crud/do-action-with-events
Events can also be enabled or disabled per resource type by redefining the multimethod sixsq.nuvla.events.config/events-enabled?
Also add per-event-type and per-category enabling/disabling of events? It can probably be useful, I would just wait for the first use-case before implementing it.
sixsq.nuvla.events.config/supported-event-types
: Return the event types supported by the given resource type, along with their category, and possibly subcategory and other flags (for now only allow-anon
).sixsq.nuvla.events.config/details-spec
: Returns the spec for the details field for the given event type.sixsq.nuvla.events.config/category-default-details-spec
: Returns the default spec for the details field for event types of the given category.After further thinking, I am still a bit reluctant about letting an operation proceed when there is an error storing an event. If later on we want to build other functionalities on top of the event system, we must be able to rely on the fact that every operation/action is accompanied by its corresponding events in the system.
I would do the following:
if there is a failure storing the initial event which traces the operation request, then we make the whole operation fail. In this case the client is expected to repeat the call.
if there is a failure in storing any subsequent events, then we log the failure but let the operation continue.
Such that:
every client call will be traced in the event logs, at least with the initial event
we can investigate issues out of band by looking for suspect events witch have an opening but not a closing (with either failure or success).
In any case, it is very important for the event creation code to be made simple and free of logical errors as much as possible: to be clear, it should only fail if the events are malformed, or if ES or the network etc. fails.
Since operations and actions on any resource will generate events by default, most of the lifecycle tests can be integrated with checks on the generated events.
I started doing this work for the module
, the deployment
and the nuvlabox0
resources, but it is a long way from being complete. I added a macro sixsq.nuvla.server.resources.lifecycle-test-utils/are-events
to make the
testing of events easier: it checks the event type, the resource, and the hierarchy of the events.
root
property: currently we are filling in the parent property which has the same meaning of a causation id in event sourcing. It would be good to also add a root
property, so that all events in a conversation can be retrieved in a single query (root, children, grand-children, etc. will all have the same value on the root property). Such a root
property would perform the same function as a correlation id in event sourcing.
I think it can be an optional property in the common properties of all resources: since parent
is defined there, the notion of root
should also make sense in general.
are-events
macro to keep a pointer to the last consumed event, and consume all remaining events when called, instead of using the count from the expectedReason for Pending: work on Projects is impeding the progress.
the events resource has a very simple schema and it is being underused.
Starting by the NuvlaBox, all relating actions (i.e. activate, commission, decommission, + custom actions , etc.) should generate an event, to be tracked in the NB detail page