bilderbuchi commented 1 year ago

To keep track of our messages, we should have unique identifiers per message, something like UUIDs. UUIDs can contain timestamps, too, which might come in handy, as we could encode that in the same information (saving space in the protocol).

UUID

The IETF has identified a couple of useful criteria for UUIDs to have:

An inspection of these implementations details the following trends that help define this standard:

Timestamps MUST be k-sortable. That is, values within or close to the same timestamp are ordered properly by sorting algorithms.

Timestamps SHOULD be big-endian with the most-significant bits of the time embedded as-is without reordering.

Timestamps SHOULD utilize millisecond precision and Unix Epoch as timestamp source. Although, there is some variation to this among implementations depending on the application requirements.

The ID format SHOULD be Lexicographically sortable while in the textual representation.

IDs MUST ensure proper embedded sequencing to facilitate sorting when multiple UUIDs are created during a given timestamp.

IDs MUST NOT require unique network identifiers as part of achieving uniqueness.

Distributed nodes MUST be able to create collision resistant Unique IDs without a consulting a centralized resource.

Most of these sound useful for us, too. For example, we could sort a database/collection of messages (maybe from different nodes) by the UUID, and they would automatically be arranged by time. Also, we could parse the timestamp out of the UUID easily (afaict).

I have reviewed the currently available UUID versions, and they don't fit that need so well. The versions 6,7,8 from that IETF draft linked above sound useful, but alas, it is still in draft state, so we probably won't see wide adaptations soon.

UUID7

Apparently, implementations of this are available, e.g. https://pypi.org/project/uuid7/ -- might be worth it to investigate if we should go with the thing that should become a standard. Maybe v6 or v8, too?

ULID

Another concept is the ULID (Universally Unique Lexicographically Sortable Identifier). 48-bit timestamp (i.e. millisecond resolution), which should be enough for our purposes, then 80 bits of randomness (that's 1e24 for every millisecond). The latter might even be reduced for our purposes.

Crucially, implementations are available in many languagues! The encoding seems also much more readable (alphabet-based instead of hex) -- UUID: a9957082-0b47-11ed-8a91-3cf011fe32f1, ULID: 01ARZ3NDEKTSV4RRFFQ69G5FAV

Customized format

We could use another format where we discard some entropy from the random part to encode human-meaningful data in, say a 3-byte message type or somesuch.

BenediktBurger commented 1 year ago

Unique IDs are good, but I like also to have a possibility to match a response to the original request. We can put that information in the header (transport level) or in the content of a message.

A free settable "subject" (maybe additional to the ID/timestamp) has the benefit, that you can filter the answer more easily, as you do not have to remember with which message ID you requested that information. Example: You request something regularly and give as subject "Request5". Whenever you receive an answer with that subject, you know how to handle it. Without the subject, you would have to keep the id of your original message and then look up what to do and to delete that entry in the list.

BenediktBurger commented 1 year ago

I'd try to keep the ID as short as possible to reduce traffic (maybe it does not matter anyway).

Another point to consider: Each computer might have slightly different clock, therefore the timestamps of the messages won't match exactly. I guess it won't be a problem, I just wanted to mention it.

bilderbuchi commented 1 year ago

I'd try to keep the ID as short as possible to reduce traffic (maybe it does not matter anyway).

I think we should measure/try that before deciding either way. I agree, if the format is variable, the random part can be tailored to what we expect. However, if a format is widely known/standardised/available via multiple implementations, that might trump saving a couple of bytes per message.

I like also to have a possibility to match a response to the original request. We can put that information in the header (transport level) or in the content of a message.

I'd put that into the header (as it's "routing info", not the payload/content per se). I was thinking of a reply-reference field that could indicate the message this is a reply to. However, imo this is orthogonal to message identifiers and we should track that in a separate issue.

Same with the message format, I can open an issue with my few notes so far in the evening.

BenediktBurger commented 1 year ago

However, if a format is widely known/standardised/available via multiple implementations, that might trump saving a couple of bytes per message.

I agree. We should take the standards into consideration for "possible". It might be a deciding factor for one or another standard.

I was thinking of a reply-reference field that

I like that name.

However, imo this is orthogonal to message identifiers and we should track that in a separate issue.

If we decide, that message identifiers are unique, they are orthogonal. If we would use the message id for a whole conversation (reply and response) they would enter here.

I think it is good to have (at least the possibility, not necessarily the obligation) a unique identifier for each message. Therefore the reply enters another field and issue.

BenediktBurger commented 1 year ago

Same with the message format, I can open an issue with my few notes so far in the evening.

Just for naming the issues: We have basically four parts:

Data protocol header (probably topic and content, maybe ID)
Data protocol content
Control protocol header
Control protocol content

bilderbuchi commented 1 year ago

I think it is good to have (at least the possibility, not necessarily the obligation) a unique identifier for each message. Therefore the reply enters another field and issue.

I think that unique ids for each message should be obligatory, as from that you can construct the sequence of messages after the fact. This won't be possible if we have one id per conversation/thread (e.g. if clocks are not perfectly synchronized you can't rely on the timestamps).

bilderbuchi commented 1 year ago

So, you want a different message format for data and control messages, correct?

BenediktBurger commented 1 year ago

I want different formats for the differen protocols, because they are like E-Mail and TV.

The data protocol does not require a recipient nor a answer modality. Also the content will be different. If we only allow data (in the sense of values, for example sensor values) and no commands etc, we can keep the data protocol very simple.

Everything else goes over the (more complicated) command protocol. You can request data via the command protocol as well, but that is only one use case.

bilderbuchi commented 1 year ago

OK, I just opened #20. Maybe the header can stay the same? Let's continue over there.

bilderbuchi commented 1 year ago

An overview/analysis of the "new" UUID formats: https://blog.devgenius.io/analyzing-new-unique-identifier-formats-uuidv6-uuidv7-and-uuidv8-d6cc5cd7391a IETF draft at https://datatracker.ietf.org/doc/html/draft-ietf-uuidrev-rfc4122bis

BenediktBurger commented 1 year ago

Thanks for the links. I'm for using UUIDv7

bilderbuchi commented 1 year ago

Another argument for UUID7: https://buildkite.com/blog/goodbye-integers-hello-uuids The context (database keys) is a bit different than ours, but we want what discussions regard as a potential weakness (leaking timestamps from DB keys). Only mentioned drawback:

UUIDs are 128 bits long, twice as large compared to the 64 bit length of other alternative solutions. There is some additional storage overhead, but this is marginal when taking into account the storage of the rest of a database row, and the benefits of migration offset the overhead for our use case.

But still, considering this standard seems bound to be ratified, I think it's worth it to go with a known standardized (IETF!) scheme that is bound to be(come) familiar with users, instead of some custom scheme that might be more efficient in some respect.

So, I agree: Let's choose UUIDv7.

pymeasure / leco-protocol

Message identifier #16

UUID

UUID7

ULID

Customized format