openziti / fabric

Geo-scale overlay network and core network programming model
Apache License 2.0
49 stars 14 forks source link

Add entity change events #562

Closed plorenz closed 1 year ago

plorenz commented 1 year ago

Add events when entities are created/updated/deleted.

Need to figure out how we want to format events. Should it based of the DB representation (easiest), REST API representation (+ for consistency), custom events format (most control, but most work and reburies more upkeep)

plorenz commented 1 year ago

Notes from Russell:

First design decision I made dealt with granularity and purpose. The MOP's resource change messages are fired every time a property on the resource changes, regardless of whether that particular set of property changes is part of a sequence that compose a more logical change in the resource. For example, creation of a Service might result in two messages where the first is the initial creation message and the second is the update of the zitiId property after it is created in the Ziti Controller.

This level of granularity makes it easy to implement. Every time a change to the resource is persisted, a change message is fired. (Side note, the change messages are bound to the DB transaction. If the transaction is rolled back, the messages are not fired. No false messages.) The downside to this is that it shifts work to the message consumers; they have to be more selective when choosing what messages they should react to since there isn't a single logical "this meaningful thing just happened" message. In my experience, it is difficult and error prone to try and predict a higher level set of logical messages like that. Even in seemingly simple cases, it can quickly unravel as you discover a need for a message half way through what was thought to be a cohesive logical activity sequence, and such discoveries tend to lead to a slippery slope of overlapping, purpose specific messages ... and now you have a messaging anti-pattern, where the message is a substitute for a remote procedure call.

That being said, my design intent with MOP messaging does not exclude potentially creating a higher level, logical event set. For example, (arbitrary and made up) we could decide to define a set of messages that strictly define key points in a Customer's life cycle, like 'changing type, teams to growth' or 'payment status, good standing to in-arrears'. I think the key point is that this kind of logical message design and the aforementioned, generalized entity change message design, should not be colluded.

Second design decision dealt with the message structure. The whole idea behind messaging is to decouple the sender from the observer, and that has implications, particularly with regard to the shape of the message. ie, the sender can't emit XML when the observer is expecting JSON. And perhaps less obvious, the sender can't arbitrarily change the logical content or shape of it and expect observers to cope with it.

In the MOP, resource change messages have a well defined envelope that all messages are expected to conform to. The envelope contains a set of properties (I'll go over them in a minute) which observers can use for much of their decision logic. Since these are part of a universally agreed envelope structure, this ensures compatibility across observers. At the same time, the envelope has support for an arbitrary payload; a 'from' and 'to' version of the resource that has changed. The envelope nor the MOP in general make any guarantee about the structure of those two properties. Thus, it is up to the observer to peer into them, deal with potential unexpected structures, etc. This keeps the coupling of logic as confined as possible. Only the class handling that message contains an expectation of structure, matching that class's purpose.

Third design aspect deals with message routing. While it's possible to just emit a stream of messages, it forces all observers to drink from the same fire-hose. We want / need the ability to declare new observers at will, and for them to be able to express, to some degree, what subset of messages are of interest to them. And it's especially nice if they can offload things like durability, retries, etc.

MOP uses RabbitMQ (a JMS compliant messaging broker; think ActiveMQ or AMQP) to meet these needs. All messages have a set of headers, just like an http request. The most important of which is the routing key (akin to an http request's url path). Rabbit (AMQP?) uses a dot'ed string notation for routing keys, so MOP resource change messages all have a consistent key pattern, like this: ResourceChange.v1.{change-type}.{resource-type}. This gives observers the ability to express interest in a subset of messages by specifying a matching pattern against message keys of that pattern. If the observer want's all messages, not just resource changes, then they subscribe to . If they want only resource change messages (version 1), they subscribe to ResourceChange.v1... In most cases, observers are interested in specific resource types like a Config, in which case the subscribe to ResourceChange.v1..Config or perhaps just when they are created, ResourceChange.v1.CREATED.Config.

Rabbit provides a lot of additional functionality for MOP (not suggesting Ziti use it; just elaborating things to consider), including the ability for an observer to declare if messages matching its interest should be queue'd in the broker when the observer is offline or just dropped, any TTL on handling those messages, what to do if they timeout or fail, and of course support for messaging topologies. After listing that out though ... I'm presuming ziti will 'dump' these messages to one of a small set of supported destinations, even if it's a local journaling file, after which its up to something else (like a 'beat' from Elastic) to capture and pump the message into whatever system is expecting it. ie, all the things I listed in the first sentence are not problems I'd expect ziti to solve.

Fourth and final design topic, and prelude to listing the universal properties in a change message envelope. Those envelope properties (and the routing key parts) contain values which have to mean something to the observer. For example, all MOP resource change messages have a {change-type} in the key and the value can be CREATED, UPDATED, or DELETED. Unless we change the version number, there will never be a DEACTIVATED, REGISTERED, or anything else. If there were, then the limited universal expectation between message sender and the observer would be violated.

In the MOP, I was able to select emerging 'standard' ways of expressing various property values. Fortunately those existed in some form or another, so I was never in the position of making up something new and specific to the ontology of change messages.

So, without further ado, here are the envelope properties of a MOP resource change message:

id - A UUID, generated for the message. Often ignored, but practically required if we ever need to ensure there is no duplication or otherwise want to track a specific message. changeType - One of CREATED, UPDATED, or DELETED. Fairly straight forward. Being an envelope property, this makes it easy for observers to quickly sort the message based on their need. "Contextual Properties" - Within what execution context did this change occur? correlationId - Placeholder for potential client provided "correlation" id value, so that we can trace all things related to that client specified request. traceId - All MOP executions (every time a thread starts business logic), a "trace" is created and its id is placed into all logging, messaging, and even API calls to other services. This lets us filter out all the noise and see only the logs, or messages in this case, which are related to a particular thread of execution. spanId - Related to a trace, whenever a logical thread of execution jumps from one service to another (ala rest request), the trace id stays the same but a new span id is created. So, a span id is attached to all logs, messages, etc. within the scope of a single service (technically we can define our own span scopes), while the trace id will link all of the spans across services. general comment: These values come from our use of emerging standards for distributed tracing. See https://opentelemetry.io/ if you want to dive into the deep end (I see lots of go stuff!) Or, simply look for a way to "instrument" ziti components here: https://opentelemetry.io/docs/instrumentation/go/ and start by figuring out how to get the trace and span ids into logs and events. After that, for actually distributed execution tracing, you'll want to look at "propagators": https://github.com/open-telemetry/opentelemetry-go-contrib/tree/main/propagators which take care of passing trace ids between components (among other info). There are emerging standards for this as well. MOP uses a format referred to as "B3" https://github.com/openzipkin/b3-propagation#single-header but we're migrating to a W3C format https://www.w3.org/TR/trace-context/ initiatingIdentityId - Who triggered the execution which caused the change that this message is about? All things in MOP can be attributed to a MOP identity id, even if it is the MOP service itself; MOP services have a "Service Identity" which is used if they cause a resource change as part of some scheduled process (for example.) "Resource Identity" - There are a set of properties that all answer this question, and each targets a slightly different purpose. As you read these, remember the design goal that these property values have to be understandable to all observers; they have to be expressed in a cross-service, enterprise language. resourceDomain - At the MOP Enterprise level, all services and all of the resource types defined and managed by a service ... they are all organized into 5 top line "domains": management, identity, networking, billing, and auth. We may end up with more, but that's it for the moment. Some of these domains have several micro-services within them, and some have one. Each micro-service defines and manages its own tree of resources. ... and all resource change messages include the domain within which the message's resource is defined. resourceType - Like domain, all resource's have a declared string, technically called the resource type "code". This is that value. Examples are network-controller, service, account, etc. resourcePath - This property contains a URN like value which starts with the resource's domain, and then contains a list of resource type and id pairs. The list is in order from the root of the resource tree all the way down to the resource type and id that this change message is about. This "resource path" structure is what is used by the MOP to perform authorization. MOP's auth is designed around granting permission on resources that subset a given path; for example, permission to update any service resource as long as it is under network-group:333:network:111. So ... from a resource change message perspective, having this property (and trusting its authenticity) lets MOP apply authorization constraints to these messages if/when we allow them to be consumed by MOP clients. resourceId - The literal id, UUID in MOPs case, of the resource. "Envelope Content" - These are resource change messages, so the content is a resource. Since this is a change message, the envelope supports the former and new version, and this is where the polymorphism starts. In MOP code, these two properties are arbitrary JSON blobs. This ensures there are no class dependencies introduced across message senders and receivers, and if the receiver wants to unmarshall one of these properties into a locally defined type, then the onus of dealing with mismatches is on that code. fromVersion - Possibly empty (on create for example), arbitrary JSON version of the resource before the change. toVersion - Possibly empty (true deletion), arbitrary JSON version of the resource after the change. The in code representation of the resource change message envelope provides support for the common task of comparing deep properties of the from and to version (at the same path address.) This is super handy for observers that get all change messages and just want to know if the property they are interested in has changed. occuredAt - Last but not least, a timestamp set by the service sending the message. No guarantee that it is a match for the resource's updated at time, or that all messages will be in order, etc. But otherwise a fundamental, "when did it occur?" time stamp.

plorenz commented 1 year ago

Here is the first pass at output from entity change events. Some things that I already know need to be changed/added:

  1. A UUID will be added (for the event commits)
  2. Event commits will be added (to let you know that the event is valid)
  3. I want to add edge.schema.version in as well
  4. Currently the json is generated directly from the DB types. We'll probably need to a custom serializer per type to make the output prettier.

Other notes:

  1. traceId is supported in the metadata. If it's provided in a request it will be passed through. Still need to test if this actually works :)
  2. There's a is_parent_event flag. This will be true if there's a parent/child relationship and the event is for the parent. This is for things like services and routers. So the fabric service and edge service will both get an event. I will likely add a flag to allow ignoring parent events since the child event will generally have all the data needed.
  3. Each event has initial_state and final_state. For creates initial_state will be null. For updates, neither should be null and for deletes final_state will be null.
{
  "namespace": "entityChange",
  "event_type": "created",
  "timestamp": "2023-04-18T14:22:22.73848403-04:00",
  "metadata": {
    "authorId": "8uCg5liKM",
    "authorName": "Default Admin",
    "fabric.schema.version": 5,
    "source": "rest[auth=edge/host=localhost:1280/method=POST/remote=127.0.0.1:41610]"
  },
  "entity_type": "services",
  "is_parent_event": false,
  "initial_state": null,
  "final_state": {
    "Id": "1Qmei9vEW5QcNFUxjOp3f1",
    "CreatedAt": "2023-04-18T18:22:22.660022897Z",
    "UpdatedAt": "2023-04-18T18:22:22.660022897Z",
    "Tags": {},
    "IsSystem": false,
    "Migrate": false,
    "Name": "events-example",
    "TerminatorStrategy": "smartrouting",
    "RoleAttributes": null,
    "Configs": null,
    "EncryptionRequired": true
  }
}

{
  "namespace": "entityChange",
  "event_type": "updated",
  "timestamp": "2023-04-18T14:23:49.446276013-04:00",
  "metadata": {
    "authorId": "8uCg5liKM",
    "authorName": "Default Admin",
    "fabric.schema.version": 5,
    "source": "rest[auth=edge/host=localhost:1280/method=PATCH/remote=127.0.0.1:44408]"
  },
  "entity_type": "services",
  "is_parent_event": false,
  "initial_state": {
    "Id": "1Qmei9vEW5QcNFUxjOp3f1",
    "CreatedAt": "2023-04-18T18:22:22.660022897Z",
    "UpdatedAt": "2023-04-18T18:22:22.660022897Z",
    "Tags": {},
    "IsSystem": false,
    "Migrate": false,
    "Name": "events-example",
    "TerminatorStrategy": "smartrouting",
    "RoleAttributes": null,
    "Configs": null,
    "EncryptionRequired": true
  },
  "final_state": {
    "Id": "1Qmei9vEW5QcNFUxjOp3f1",
    "CreatedAt": "2023-04-18T18:22:22.660022897Z",
    "UpdatedAt": "2023-04-18T18:23:49.374979246Z",
    "Tags": {},
    "IsSystem": false,
    "Migrate": false,
    "Name": "events-example",
    "TerminatorStrategy": "smartrouting",
    "RoleAttributes": [
      "one",
      "two"
    ],
    "Configs": null,
    "EncryptionRequired": true
  }
}

{
  "namespace": "entityChange",
  "event_type": "deleted",
  "timestamp": "2023-04-18T14:25:38.427682958-04:00",
  "metadata": {
    "authorId": "8uCg5liKM",
    "authorName": "Default Admin",
    "fabric.schema.version": 5,
    "source": "rest[auth=edge/host=localhost:1280/method=DELETE/remote=127.0.0.1:39138]"
  },
  "entity_type": "services",
  "is_parent_event": false,
  "initial_state": {
    "Id": "1Qmei9vEW5QcNFUxjOp3f1",
    "CreatedAt": "2023-04-18T18:22:22.660022897Z",
    "UpdatedAt": "2023-04-18T18:23:49.374979246Z",
    "Tags": {},
    "IsSystem": false,
    "Migrate": false,
    "Name": "events-example",
    "TerminatorStrategy": "smartrouting",
    "RoleAttributes": [
      "one",
      "two"
    ],
    "Configs": null,
    "EncryptionRequired": true
  },
  "final_state": null
}
dovholuknf commented 1 year ago
plorenz commented 1 year ago
  • my biggest gripe is that detecting the delta is put on the consumer of this event. I can understand why you chose to do it this way, but it seems redundant to supply all the same data and make the other end process it

    • does a delete event really need its initial_state? seems like it's not useful/relevant at that point

    • IDs/IsSystem don't ever change right? does it make sense to send that sort of information in these payloads? It feels like it doesn't belong to me but other immutable data (if there is any) should at most be there one time imo. certainly not twice like is shown in the change example

    • authorId/authorName seems redundant to me

    • source is awfully verbose. i don't quite understand who the expected consumer for that would be

For the first two, I think the question is whether events need to be useful on their own, or if it's ok to reconstitute the full state based on a full audit trail. Answering that is something that maybe ops folks would be best suited for. I also would tend prefer a diff style and no state on the delete.

If we do deltas then IDs and anything immutable would get dropped. If not, we might want to strip them out for updates.

authorId/authorName is a bit redundant. authorId is required, authorName is for convenience. Can't have just authorName, because that can change over time.

source is for audit. If consensus is that it's too verbose I don't mind trimming it down.

dovholuknf commented 1 year ago

In retrospect, I like the name of the actor initiating the change. that way if the id is removed, you still know 'who' it was. i retract that comment :)

plorenz commented 1 year ago

Here's an updated version:

Changes:

{
  "namespace": "entityChange",
  "event_id": "a66478ff-f823-4f44-84d9-c3105a65923d",
  "event_type": "created",
  "timestamp": "2023-04-19T10:48:22.13608773-04:00",
  "metadata": {
    "authorId": "8uCg5liKM",
    "authorName": "Default Admin",
    "source": "rest[auth=edge/host=localhost:1280/method=POST/remote=127.0.0.1:35008]",
    "version": "v0.0.0"
  },
  "entity_type": "services",
  "is_parent_event": false,
  "initial_state": null,
  "final_state": {
    "Id": "4LCezrTjm4jYjf5JV8bJbY",
    "CreatedAt": "2023-04-19T14:48:22.04739518Z",
    "UpdatedAt": "2023-04-19T14:48:22.04739518Z",
    "Tags": {},
    "IsSystem": false,
    "Migrate": false,
    "Name": "test",
    "TerminatorStrategy": "smartrouting",
    "RoleAttributes": null,
    "Configs": null,
    "EncryptionRequired": true
  }
}
{
  "namespace": "entityChange",
  "event_id": "a66478ff-f823-4f44-84d9-c3105a65923d",
  "event_type": "committed",
  "timestamp": "2023-04-19T10:48:22.152906329-04:00"
}
Russell-Allen commented 1 year ago

The main event schema is pretty close to perfect IMO.

I presume all properties except initial_state and final_state are what I refer to as "envelope" properties; they can be relied upon to exist and contain a consistent value format. Thus observers can reliably read these properties prior to reading the more polymorphic initial and final state properties.

If these "envelope" properties were to ever change (additive or breaking) ... would a client be able to use the metadata.version along side ziti release notes to "adjust their expectations"?

Nit-Picks:

dromedaryCamelCase is my personal preference. :)


Diff vs Full State ...

~I prefer~ a full state is effectively required. A diff only event presumes that a consumer's interest in the event is isolated and determinable just by the property which changed. Consider a diff only event that indicates that #foo were added to an Identity's roleAttributes. Without the inclusion of the Identity's disabled property, a consumer would not be able to self-determine that it can ignore the event. A diff only view would force non-trivial consumers into making an API call to read the full state of the entity, which couples the event volume to controller API load. Additionally, a diff only consumer is not always able to get a Consistent (as in the C in ACID) view of the entity ... at least, not without maintaining a stack of diffs (in order and without gaps.)

While a full state event is more verbose and often redundant, the consumer can rely on the initial and final state to be internally Consistent. That subtle benefit, and having all the data which might be needed, outweighs the cost of additional bytes (which is very low as is.)


authorId & authorName - I have little concern for the extra bytes of data in this plane. Having the name also means one less thing to have to hit the Ziti API to get, and it prevents the utility of prefacing a malicious change with a change-name-to-blame request. :)


source - Don't remove it, don't trim it down, but consider breaking it up? As a packed string, it forces the consumer to parse the string value and that parsing logic is tightly coupled ... expresses an expectation that will be very fragile or Ziti will be stuck with. So ... something like(?):

"source": {
  "scheme": "rest",
  "auth": "edge",
  "host": "localhost",
  "port": "1280",
  "method": "POST",
  "remote": "127.0.0.1:35008"
}
plorenz commented 1 year ago

Hi @Russell-Allen,

Some of the metadata properties will not always be present. There are three different scenarios:

  1. REST Request - examples above
  2. Control Channel changes on behalf of ERT - source will be different. It will have authorId/name of the router identity.
  3. Controller initiated changes. I think these are mostly on api sessions/sessions, so they'll go away when those entities are no longer in the DB. No author.

I did add an authorType, which should always be present. That way you can check if, based on the author type, the other fields should be present. Maybe should make author an entity?

"metadata" : {
  "author" : {
    "type " : "identity",
    "id" : "foo",
    "name" : "bar",
}

We could fill in the authorId/Name with something for non-identity authors.

I'll fix the casing inconsistency. It'll probably snake_case, since I think the rest of the events are standardized on that.

I'll see about splitting up source as well.

Russell-Allen commented 1 year ago

All good.

I have no worries about 'envelope' properties having some differences based on the type of event, presuming...

That way you can check if, based on the author type, the other fields should be present.

:+1:

I like the entity version of Author.

We could fill in the authorId/Name with something for non-identity authors.

If it makes sense, great, and if it doesn't ... if there's not a reasonable "name" like value (as an example), then omit that property from the entity. (see bullet list above about "some differences" in schema)

I'll fix the casing inconsistency ... probably snake_case ...

:+1: That works. Consistency is far more important than a style preference.

plorenz commented 1 year ago

Here's the current output. Let me know if anyone sees anything of concern.

{
  "namespace": "entityChange",
  "event_id": "20a9ba5e-1f6c-4347-981c-09ee6d673c70",
  "event_type": "deleted",
  "timestamp": "2023-04-21T17:32:12.63708354-04:00",
  "metadata": {
    "author": {
      "type": "identity",
      "id": "8uCg5liKM",
      "name": "Default Admin"
    },
    "source": {
      "type": "rest",
      "auth": "edge",
      "local_addr": "localhost:1280",
      "remote_addr": "127.0.0.1:43206",
      "method": "DELETE"
    },
    "version": "v0.0.0"
  },
  "entity_type": "services",
  "is_parent_event": false,
  "initial_state": {
    "id": "4BN2OhyZrUipDqLWP8jWvP",
    "created_at": "2023-04-21T21:30:11.125151988Z",
    "updated_at": "2023-04-21T21:30:11.125151988Z",
    "tags": {},
    "is_system": false,
    "name": "echo",
    "terminator_strategy": "smartrouting",
    "role_attributes": [
      "echo"
    ],
    "configs": null,
    "encryption_required": true
  },
  "final_state": null
}
{
  "namespace": "entityChange",
  "event_id": "20a9ba5e-1f6c-4347-981c-09ee6d673c70",
  "event_type": "committed",
  "timestamp": "2023-04-21T17:32:12.639025872-04:00"
}
plorenz commented 1 year ago

Final format:

{
  "namespace": "entityChange",
  "eventId": "326faf6c-8123-42ae-9ed8-6fd9560eb567",
  "eventType": "created",
  "timestamp": "2023-05-11T21:41:47.128588927-04:00",
  "metadata": {
    "author": {
      "type": "identity",
      "id": "ji2Rt8KJ4",
      "name": "Default Admin"
    },
    "source": {
      "type": "rest",
      "auth": "edge",
      "localAddr": "localhost:1280",
      "remoteAddr": "127.0.0.1:37578",
      "method": "POST"
    },
    "version": "v0.0.0"
  },
  "entityType": "services",
  "isParentEvent": false,
  "initialState": null,
  "finalState": {
    "id": "6S0bCGWb6yrAutXwSQaLiv",
    "createdAt": "2023-05-12T01:41:47.128138887Z",
    "updatedAt": "2023-05-12T01:41:47.128138887Z",
    "tags": {},
    "isSystem": false,
    "name": "test",
    "terminatorStrategy": "smartrouting",
    "roleAttributes": [
      "goodbye",
      "hello"
    ],
    "configs": null,
    "encryptionRequired": true
  }
}
{
  "namespace": "entityChange",
  "eventId": "326faf6c-8123-42ae-9ed8-6fd9560eb567",
  "eventType": "committed",
  "timestamp": "2023-05-11T21:41:47.129235443-04:00"
}