Open staltz opened 1 year ago
I do like ssb:message/AUTHOR_ID/NESTED_FEED_TYPE/BLAKE3_HASH16
and yeah maybe this shorting of author was not the best idea. Especially because we don't need this for the previous pointer. Mainly for tangles. I still think it would be best to keep the hash to 16 bytes. There is the birthday paradox to take into account that halves the size. There are no known problems with blake3 atm (sha1 e.g. has some flaws so even if it's 20 bits, it's still effectively much less).
Apart from the comments I made in a separate issue, which implies that I'm more in favor to specialise Feeds and decouple them from signing keys which is at the opposite direction.
I have no strong opinion on how a reference to a social media message should be done in human readable way.
I like generally the approach of IPFS https://github.com/multiformats/cid based on multiformats which make simple and unique to conversion from binary to/from various human readable context make more sense.
Then the question is what securely identify a message, a feed, a blob and what additional information we want in those pointers.
For the Feed, I'm in favor of some hash of a bootstrap message instead a pub key For the Messages,
I have no strong opinion on how a reference to a social media message should be done in human readable way.
@gpicron if you're talking about ssb:message/AUTHOR_ID/NESTED_FEED_TYPE/BLAKE3_HASH16
then we're not aiming for human readability with this, in fact I don't expect any SSB URI to be human readable. They might as well be binary encoded with BFE, which should mostly be a 1-to-1 mapping with URIs.
So URI string or BFE is not the point, the point is what information we are using to identify a message, and I think it has to include the author ID of that message, because if you JIT replicate this message without knowing who authored it (assume that this JIT message is an emoji reaction), then I would like to know who created the message, so I could visit their profile and so forth.
Another idea I have for this is to split the concept into:
metadata
bytes with
signature
bytes ssb:message/AUTHOR_ID/TYPE/MSG_HASH
The msg hash would be used in fields such as previous
where there is an array of these, and we don't need the information provided in ssb:message/AUTHOR_ID/TYPE
because all of that is already implicit in the message already.
However, the Message UID would be used when you refer to other author's messages, e.g. in thread tangles, mentions, etc.
I still think it would be best to keep the hash to 16 bytes. There is the birthday paradox to take into account that halves the size.
Oh and about this problem, I took a look and Wikipedia Birthday paradox page says
The birthday problem in this more generic sense applies to hash functions: the expected number of N-bit hashes that can be generated before getting a collision is not
2^N
, but rather only2^(N/2)
.
So with 8 bytes, N in this case is 64, which means we can generate approximately 4 billion messages in a nested minibutt feed before we get into a collision. To me this seems sufficient for social network use cases. I can only think of an application where 4 billion messages are authored if you're doing something "big data", which is unlikely for us. And given sliced replication deletions, we're not going to hold 4 billion messages from a single author (and a single msg type) in disk simultaneously.
PS: 4 billion multiplied by 150 bytes (the tiniest message size you can think of) is 650 GB.
Another idea I have for this is to split the concept into:
* Message Hash: just the 16 first bytes of the blake3 hash of the concatenation of `metadata` bytes with `signature` bytes * Message UID: `ssb:message/AUTHOR_ID/TYPE/MSG_HASH`
The msg hash would be used in fields such as
previous
where there is an array of these, and we don't need the information provided inssb:message/AUTHOR_ID/TYPE
because all of that is already implicit in the message already.However, the Message UID would be used when you refer to other author's messages, e.g. in thread tangles, mentions, etc.
Agree. The separation is already there, I'll update the message UID example.
What do you think of the proposal in https://gist.github.com/staltz/5b58934fc6a013df5efa2c40f155388b#gistcomment-4479818 (second part). In summary:
Yes, I'm very open to that idea. It's the all backlinks are a tangle idea. Just trying to take it in steps.
There is one interesting case when you replicate content based on different contexts. Lets say you replicate a thread that includes a message from person A. That message could be correct in the threads tangle, but it might not be correct from the author tangle perspective. And you might never know that, but someone replicating that author from you might see. So what could these errors be: it could either that timestamp was not increasing or that the previous (author, feed tangle) could point to a message that is further in the chain (loops). I'll note that this is not that different from any other kind of out-of-order replication. In classic we also have this problem if you replicated a message ooo.
That's why I propose to grade somehow the validations and trust and to the app decide/display what is acceptable and the grade of validation (a bit like in atacama). There are validation on the message alone (signature if you have the public key of the author and trust him), the hash of the content and then the chains. The chains validation per "relation" should somehow scoring "how far back I have checked it". Similarly, this is what is done in some blockchains for lite node not having full history, after several blocks they consider as valid and relies on devices having the full history to detect long branches. And actions allowed by node with partial history are limited as risk mitigation for the applicative context.
Point 2, advancing stepwise, does not mean that you cannot make the feed format open for extension. If you say now there is field previous, that make more complex later introducing a generic field "backlink' as an array of pair (id, relation). Doing it now and telling we define now the behaviour relative to the relation "feed-previous" is the same amount of work and is stepwise but leave it open for easier evolutions. Same applies for id's formats. Telling that for id's, keys and hash and crypto we embrace multiformat do not close the door for evolution, even you specify that the only algo currently supported will be ed25519 for sign and blake3-128 for hash.
I know that you have defined the message ID as
But why truncate the author ID? Given a message ID from a stranger, which I don't have in my database, I would like to have the option of replicating that author. But if I don't have their full author ID, I won't be able to ask other peers for their content.
That's just one problem. The other problem is about the "message shortcut", which ideally in SSB we have spoken that it could be
ssb:message/AUTHOR_ID/SEQUENCE
instead ofssb:message/HASH
. authorId+sequence is neat, but in the case of minibutt which doesn't havesequence
, what would we do instead?Perhaps
ssb:message/AUTHOR_ID/NESTED_FEED_TYPE/BLAKE3_HASH16
where
AUTHOR_ID
e.g.QlCTpv...
NESTED_FEED_TYPE
e.g.post
BLAKE3_HASH16
would be the first 16 bytes of the "concatenation of metadata bytes with signature bytes", or perhaps something even shorter, like the first 8 bytes (8 bytes seems enough because it can represent 18 billion billion things, and a single author's collection ofposts
or whatever are RARELY going get even close to that).Thoughts?
@arj03 @gpicron