Message Identifier Discussion

skilesare commented 5 months ago

The message identifier should be flexible enough so that for simple systems it can just be an increasing nonce, but for more complex systems we want deterministic ordering and the ability to know you missed a message.

How do we accomplish this?

Vector Clocks? Lamport Timestamps? Including previous complex nonce in each message?

{timestamp : Nat64} | {PublisherID: Nat64} | {(increasing nonce * broadcaster count)} //for partitioning

Gekctek commented 5 months ago

Id be curious on any and all scenarios that have requirements around this Personally I dont think my purposes need the id to have order built directly in them, but just some sort of ordering mechanism. Whether that would be a link to the 'previous' ones or just simply a timestamp that can order the events, probably just the timestamp.

But the other question is around the 'internal' vs 'external' ids. like an id for the message itself and an id for the pub/sub model. Each app could have its own id system that is determined by them, but the pub/sub model could use an stricter id system for handling of ordering events or being able to replay/find certain events

ava-vs commented 5 months ago

If we're talking about the minimal version so far, a separate id seems redundant. I think the timestamp is enough for most applications.

skilesare commented 5 months ago

We can make this pretty complex or pretty simple. First the complicated way. Take the following key attributes we're looking for and score them across a set of different ID systems.

Key Attributes:

Simplicity: Ease of implementation and understanding.
Ordering: Ability to maintain or determine the order of messages.
Detecting Missing Messages: Capability to identify gaps in the message sequence.
Scalability: How the system performs under increased load.
Performance: Impact on overall system performance, including processing and memory overhead.
Flexibility: Adaptability to various use cases and system complexities.
Universality: Applicability across different systems and environments.

Evaluation Matrix (Empty):

ID System	Simplicity	Ordering	Missing Messages	Scalability	Performance	Flexibility	Universality
Hybrid Timestamp-ID
UUID
Hash-based IDs
Hierarchical IDs
Lamport Timestamps with Publisher ID
Composite Key ID
Lexicographical Order IDs
Event Sourcing IDs
Blockchain-inspired IDs
Flattened Blob from Structured Data

We can go through each of these, score them, and see where we see the most highs.

The simple way:

Reduce it to two questions:

Are there any of the above that can't be represented by a fixed-width blob converted to a nat, and are we sure that most languages(Rust, Motoko, Azel, Java?, C#?) have an unbounded nat? Because if so it is really easy to just use a nat because it is useful in the smallest implementation (nonce), medium(timestamp), and distributed(as soon as you have to scale to more than one canister you'll be in a distributed setting...will happen much faster for any serious application on the IC than any of us may anticipate).

For example: the_quick_brown_fox can be represented as:

28690326576869079995668384064297231324925944108933159956634009002664434410779435694378716319761052827297390018201760082566703737107491700197379652288500795186180221022085578872

...which is a nat. Not a great nat, but a nat. Ours would be a bit more thoughtful. Additionally, an Nat can be converted into a blob and represented as a lexically orderable string.

Interesting hacker News article: https://news.ycombinator.com/item?id=18768909

Crockford Base32 looks like a good way to create a standard way of converting these to human-readable and shorten them up a bit: https://www.crockford.com/base32.html.

If a is true do we want a top-level variable pointing back to a previous ID or not? Is it too confusing if it is an opt nat:

ie

type event = record {
     event_id: nat;
     namespace: text;
     data: ICRC16;
     prev_id: opt nat?
     ....
};

For a distributed system, the prev id can be a complex encoded nat that includes the partition. The subscriber client can detect if it has missed a message from a particular partition by deciding the prev_id and making sure they have seen it, or else they wait or request. How this is interpreted may depend on if the event is a single-source or multi-source event.

Single Source - Through strict partitioning, a subscriber can universally know if it is missing messages by the single source using a nonce based on the number of partitions to order messages. Multi-Source - In multi-source systems, the receiving item will need to keep track of ordering in source partitions and may need to rely on a heartbeat or epoch message of some kind to close out proper ordering of messages.

All of this to say: Unless there is some magic ID in the list above, or that we haven't thought of I'd propose some language along the lines of below:

Event Ids are represented as Natural numbers with infinite precision. These numbers MAY be blob representations of more complicated numbering schemes converted to natural numbers. If the ID is encoded, it should be encoded as Crockford Base32 as specified in https://www.crockford.com/base32.html

Events MAY specify a previous id in prev_id. to indicate the immediately preceding message id known by the broadcasting system. Event systems SHOULD provide null in systems where event ordering is not important or where ordering is dependent on details internal to the ID. Event systems MAY interpret the previous id as a matter of implementation such that:

Single Publisher, Single Broadcaster systems should have a consistent chain of messages and no messages should be dropped.

Single Publisher, Multi Broadcaster systems should have a consistent chain of messaging according to nonce partitioning and no messages should be dropped.

Multi Publisher, Multi-Brodcaster systems should have consistent chains across publisher-based partitions such that either eacht partition stays consistent, or that all messages may be ordered provided there is an event-specific epoch close-out schema.

I need to think about this last one a bit more....but I think this is basically emulating how the IC works in knowing that everyone has agreed on the ordering of messages in a block.

Keep in mind that while this may seem complicated in specification language, we are creating something that very simple implementations can check themselves against. A simple private pub-sub that doesn't care about ordering and just increases a nonce can check itself against this and see that it doesn't violate the schema.

skilesare commented 5 months ago

Another thing that popped up...and along with some other comments is what to do with the difference between the EventID(assigned by the publisher) and the NotificationID(Assigned by the Broadcaster). Does the subscriber need to know this? Is it always just EventID + SubscriberID? Is is helpful in a debugging situation for the Subscriber to know it? If a downstream system is just looking at a reporting dashboard they may just have the event attributes and may be oblivious to canister ids...in which case, wouldn't they want to know their notification id? Perhaps this should be a separate Issue?

My gut is that it would be helpful and simplify error reporting, message confirmation, etc, and we can make it optional as it won't be very useful for Single publisher, single broadcaster, single subscriber implementations. It kind of sucks that you can't just not include the value (at least in motoko) and have it default to null. But I guess if were building libraries, the simple libraries can just not include it in the client types and devs won't really know it is there.

This would split the types between an EventType and a DeliveredEventType:

type Event = record {
     event_id: nat;
     namespace: text;
     data: ICRC16;
     prev_id: opt nat?
     ....
};

type DeliveredEvent = record {
     event_id: nat;
     delivery_id: nat;
     namespace: text;
     data: ICRC16;
     prev_id: opt nat?
     ....
};

ava-vs commented 4 months ago

Are there any of the above that can't be represented by a fixed-width blob converted to a nat, and are we sure that most languages(Rust, Motoko, Azel, Java?

Java 'int' has a very small size of 4 bytes. Java 'long' has 8 bytes and can't contain very long numbers (max is 9,223,372,036,854,775,807).

The pair <publisher's canisterId:timestamp> or (publisher's canisterId, timestamp) as eventId seems sufficient to uniquely identify a message even in a distributed system. It is simple, clear, and unambiguous. So my vote for Lamport timestamps with publisher ID.

A pair (eventId, eventFilter) could be used to handle specific subscription events, since subscribers use the same filters as publishers.

eventFilter could be a pair (text, blob) with a topic name and a filter within the topic or just text with candyPath.

The optional prev_id seems logical and allows chaining to be handled.

ava-vs commented 4 months ago

what to do with the difference between the EventID(assigned by the publisher) and the NotificationID(Assigned by the Broadcaster).

As I see it, the Broadcaster role is a middleware to send event messages to unknown (and untrusted) recipients (canisters). If so, the broadcaster's own id doesn't seem to be required, as the middleware only accepts messages with the event id from the publisher and forwards them to all subscribers via a one-way message. It will then send replies back to the publisher via a one-shot message with eventId and subscriberId.

skilesare commented 4 months ago

Vote passed

icdevs / ICEventsWG

Message Identifier Discussion #3

Key Attributes:

Evaluation Matrix (Empty):