Message ID issues - Githubissues

staltz commented 1 year ago

I know that you have defined the message ID as

The message ID is the first 16 bytes of the author concatenated with the message hash. The message hash is 16 first bytes of the the blake3 hash of the concatenation of metadata bytes with signature bytes .

But why truncate the author ID? Given a message ID from a stranger, which I don't have in my database, I would like to have the option of replicating that author. But if I don't have their full author ID, I won't be able to ask other peers for their content.

That's just one problem. The other problem is about the "message shortcut", which ideally in SSB we have spoken that it could be ssb:message/AUTHOR_ID/SEQUENCE instead of ssb:message/HASH. authorId+sequence is neat, but in the case of minibutt which doesn't have sequence, what would we do instead?

Perhaps ssb:message/AUTHOR_ID/NESTED_FEED_TYPE/BLAKE3_HASH16

where

AUTHOR_ID e.g. QlCTpv...
NESTED_FEED_TYPE e.g. post
BLAKE3_HASH16 would be the first 16 bytes of the "concatenation of metadata bytes with signature bytes", or perhaps something even shorter, like the first 8 bytes (8 bytes seems enough because it can represent 18 billion billion things, and a single author's collection of posts or whatever are RARELY going get even close to that).

Thoughts?

@arj03 @gpicron

arj03 commented 1 year ago

I do like ssb:message/AUTHOR_ID/NESTED_FEED_TYPE/BLAKE3_HASH16 and yeah maybe this shorting of author was not the best idea. Especially because we don't need this for the previous pointer. Mainly for tangles. I still think it would be best to keep the hash to 16 bytes. There is the birthday paradox to take into account that halves the size. There are no known problems with blake3 atm (sha1 e.g. has some flaws so even if it's 20 bits, it's still effectively much less).

gpicron commented 1 year ago

Apart from the comments I made in a separate issue, which implies that I'm more in favor to specialise Feeds and decouple them from signing keys which is at the opposite direction.

I have no strong opinion on how a reference to a social media message should be done in human readable way.

I like generally the approach of IPFS https://github.com/multiformats/cid based on multiformats which make simple and unique to conversion from binary to/from various human readable context make more sense.

Then the question is what securely identify a message, a feed, a blob and what additional information we want in those pointers.

For the Feed, I'm in favor of some hash of a bootstrap message instead a pub key For the Messages,

Message Hash: is poorly informative but has the advantage to be fix 32 bytes
Feed ID + sequence number: is not an option if we want a true eventual consistency which means accepting forks and leave the application take care which means duplicate seq number. At byte level, it a few bytes more (probably 1 or 2 in most cases using varint)
A trade off: at millisecond granularity , Timestamps have a low probability to collide if the app is working (we can additionally keep the rule that timestamps must be strictly higher than previous messages TS). At byte level, in absolute value (relative to Unix epoch), that cost 6-7 bytes. But if define the timestamp as relative to the feed bootstrap message timestamp, that much shorter. And if we reduce the precision to deciseconds or seconds (which is a more realistic precision for most cases and that can be configured in the bootstrap message), 4 bytes are enough for tens of years. And actually, I think that feeds, which means a writer key pair should never live so long security wise. Instead I think we should limit it to a few months max to force key rotations.

staltz commented 1 year ago

I have no strong opinion on how a reference to a social media message should be done in human readable way.

@gpicron if you're talking about ssb:message/AUTHOR_ID/NESTED_FEED_TYPE/BLAKE3_HASH16 then we're not aiming for human readability with this, in fact I don't expect any SSB URI to be human readable. They might as well be binary encoded with BFE, which should mostly be a 1-to-1 mapping with URIs.

So URI string or BFE is not the point, the point is what information we are using to identify a message, and I think it has to include the author ID of that message, because if you JIT replicate this message without knowing who authored it (assume that this JIT message is an emoji reaction), then I would like to know who created the message, so I could visit their profile and so forth.

staltz commented 1 year ago

Another idea I have for this is to split the concept into:

Message Hash: just the 16 first bytes of the blake3 hash of the concatenation of metadata bytes with signature bytes
Message UID: ssb:message/AUTHOR_ID/TYPE/MSG_HASH

The msg hash would be used in fields such as previous where there is an array of these, and we don't need the information provided in ssb:message/AUTHOR_ID/TYPE because all of that is already implicit in the message already.

However, the Message UID would be used when you refer to other author's messages, e.g. in thread tangles, mentions, etc.

staltz commented 1 year ago

I still think it would be best to keep the hash to 16 bytes. There is the birthday paradox to take into account that halves the size.

Oh and about this problem, I took a look and Wikipedia Birthday paradox page says

The birthday problem in this more generic sense applies to hash functions: the expected number of N-bit hashes that can be generated before getting a collision is not 2^N, but rather only 2^(N/2).

So with 8 bytes, N in this case is 64, which means we can generate approximately 4 billion messages in a nested minibutt feed before we get into a collision. To me this seems sufficient for social network use cases. I can only think of an application where 4 billion messages are authored if you're doing something "big data", which is unlikely for us. And given sliced replication deletions, we're not going to hold 4 billion messages from a single author (and a single msg type) in disk simultaneously.

PS: 4 billion multiplied by 150 bytes (the tiniest message size you can think of) is 650 GB.

arj03 commented 1 year ago

Another idea I have for this is to split the concept into:
* Message Hash: just the 16 first bytes of the blake3 hash of the concatenation of `metadata` bytes with
  `signature` bytes

* Message UID: `ssb:message/AUTHOR_ID/TYPE/MSG_HASH`
The msg hash would be used in fields such as previous where there is an array of these, and we don't need the information provided in ssb:message/AUTHOR_ID/TYPE because all of that is already implicit in the message already.

However, the Message UID would be used when you refer to other author's messages, e.g. in thread tangles, mentions, etc.

Agree. The separation is already there, I'll update the message UID example.

gpicron commented 1 year ago

What do you think of the proposal in https://gist.github.com/staltz/5b58934fc6a013df5efa2c40f155388b#gistcomment-4479818 (second part). In summary:

In the metadata part, generalize the concept of backlink. It becomes an array of pair (relation, message id). The "previous" field is transformed in a backlink ("feed", id). And you can add useful application backlinks ("thread", id of root message of a thread), ("response", id of the post to which it respond) , ("status", id of the previous status message)
to generate a "unique" id, whatever the hash algo (actually if we choose to use a multibase hash, we don't have to choose and we are future proof), hash the metadata part which should contain then the hash of the content. Doing so, you can exchange the metadata part with other peers in various algorithms and partial replication scenarios without breaking chain validations but limiting the bandwidth. Additionally, dropping the content because of updates and deletes do not break the chain, because the crypto chain is on the metadat part.
In the idea of "all is DAG", why do we need a single way and uri to refer to a given message ? To be able to fetch it easily, what information I need to provide to another peer ? a. The root of a DAG it is part of, b. the relation ("feed", "thread", etc) of the backlink forming that DAG, c. some ID with limited collision within that DAG. At worst if there is a collision on the last part, the peer can respond with both messages". These are pointers, not unique identifier of the content. Like in git, underlying unique ID of a item is the full hash, but can refer to it in commands with only the first bytes. If that's enough to identify the item in the context (current branch, project) the git command will not complain.

arj03 commented 1 year ago

Yes, I'm very open to that idea. It's the all backlinks are a tangle idea. Just trying to take it in steps.

arj03 commented 1 year ago

There is one interesting case when you replicate content based on different contexts. Lets say you replicate a thread that includes a message from person A. That message could be correct in the threads tangle, but it might not be correct from the author tangle perspective. And you might never know that, but someone replicating that author from you might see. So what could these errors be: it could either that timestamp was not increasing or that the previous (author, feed tangle) could point to a message that is further in the chain (loops). I'll note that this is not that different from any other kind of out-of-order replication. In classic we also have this problem if you replicated a message ooo.

gpicron commented 1 year ago

That's why I propose to grade somehow the validations and trust and to the app decide/display what is acceptable and the grade of validation (a bit like in atacama). There are validation on the message alone (signature if you have the public key of the author and trust him), the hash of the content and then the chains. The chains validation per "relation" should somehow scoring "how far back I have checked it". Similarly, this is what is done in some blockchains for lite node not having full history, after several blocks they consider as valid and relies on devices having the full history to detect long branches. And actions allowed by node with partial history are limited as risk mitigation for the applicative context.

Point 2, advancing stepwise, does not mean that you cannot make the feed format open for extension. If you say now there is field previous, that make more complex later introducing a generic field "backlink' as an array of pair (id, relation). Doing it now and telling we define now the behaviour relative to the relation "feed-previous" is the same amount of work and is stepwise but leave it open for easier evolutions. Same applies for id's formats. Telling that for id's, keys and hash and crypto we embrace multiformat do not close the door for evolution, even you specify that the only algo currently supported will be ed25519 for sign and blake3-128 for hash.

ssbc / ssb2-discussion-forum

Message ID issues #6