Previous validation - Githubissues

staltz commented 1 year ago

I said in #8:

About (2) failing due to msg 2 not being in disk anymore, we could use a rule of thumb to just ignore 3'' deeming it too old. What do you think?

Actually, I'm not sure about this. Maybe the new message is merging many "heads", some old heads and some new heads, so maybe the merge is valuable even if one of the heads is very old. And anyway with sliced replication, it is always the case where one or more "oldest" messages have missing previous from disk.

I think we should rethink what the validation logic for "previous" should mean in minibutt.

Previous validation in ssb-classic does just ONE thing: it takes the previous message (from db2 base or state), calculates its msg ID, and matches with newMsg.value.previous. Nothing else.

So in essence it is just checking whether the new message points to the latest head as its previous. In other words, classic validation wants to rule out forking off of an older message in the feed.

But forking off of an older message in the feed is a feature in minibutt!

Which leads us to the question: what would consist of an invalid previous field in minibutt? I can think of:

The previous ID is gibberish, doesn't refer to any actual message
The previous ID refers to a message belonging to ANOTHER feed
The previous ID is in the wrong type (e.g. it's an integer)

All these are rather unlikely to happen unless there is a buggy implementation. I don't know why a peer would lie about their own feed structure, bugging it on purpose. Perhaps to prevent other peers from crashing on invalid types, we should do basic validation that the previous ID "seems" correct.

But due to sliced replication and deletions, we can't actually fetch the previous msg from disk and do a real check.

arj03 commented 1 year ago

The way I see it is there are 4 options:

no previous and no other mechanism for tying things together other than some applications specific tangles
no previous and seq + applications specific tangles. This is what was in the first version before seq became a previous array. The reason to have seq is that you can detect if you are missing something when doing author + feed based replication. Also the seq you keep locally is just the latest, so this model is easy to adapt to the current stack.
previous array, like we currently have. This gives more guarantees than having seq, but is also more complicated
tangle like backlinks (previous if you want). This is the most general model but also the most complex

staltz commented 1 year ago

Can you explain better the second option, and how it would function in the presence of forks?

arj03 commented 1 year ago

Sure. Fundamental is that timestamps must always be increasing. A fork is any case where you receive a message with either sequence the same as your current head, or less. Here I think of forks in the general case, so it could either be recovery or another device that does not have the latest message before it posted a message. I don't think forks are really that big of a problem. Either your application model for the conflicts are: last writer wins and this is easy because of the timestamps or if they are more complex, then you must rely on the application specific tangles anyway. Tangles are even more general than multiple previous because they can handle multiple authors.

gpicron commented 1 year ago

You know more or less my point of view. I think the generic "all is tangle" is better. We need anyway to find an efficient replication of tangles. So what appear more complex initially may reveal easier at the end because we end with a single replication logic optimized for tangles and not 2, one for feed and one for tangle.

staltz commented 1 year ago

@arj03 I see what you mean, thanks. Forks aren't hard to deal with IF we accept that we will lose some content. E.g. if the losing fork has a long post that I took 30min to write, I might get quite sad if the winning fork "erased" it. So I think some kind of CRDT system that creates merges and preserves content in the forks is better. What you describe is kind of like (pardon me for this comparison) blockchains, because forks can always occur there but the forks are ignored/pruned and the branch with the most "proof of work" wins.

There's still the alternative that if we go with your idea, then we could "copy-paste" the losing fork's content onto new messages on the winning fork. Kind of like a git cherry-pick.

That said, I do like the idea of single replication logic for tangles, as @gpicron said.

staltz commented 1 year ago

last writer wins and this is easy because of the timestamps

I'm thinking about your suggestion @arj03 and it seems there is a way it could break in EBT. That's because multiple peers may have different forks of the feed, and if you compare fork A and fork B, A might win over B, but if you compare fork A and fork C, C might win. So one of the peers chooses to continue appending on C while another peer continues to append on A, and this fork isn't resolved.

I think the problem is that at no point does a peer ever get two forks locally in order to compare the forks. What happens instead is that if you inform you have everything up until sequence 10, nobody will send you an alternative sequence 10, they will just send you 11 and so forth. So you end up never downloading the fork, and thus you can't compare, you don't know if the fork would be a winner.

Unless we implement something else than EBT replication...

staltz commented 1 year ago

Another comment: for SSB2 I think we will need tangle sync anyway, because we're not going to do hops 2+ replication, we're just going to have:

Sliced replication of nested feeds in my followlist (hops 1)
Tangle sync of the threads referenced by my followlist

I'm not saying that we should drop your sequence number (and previousless msgs) idea, I think we should consider all the options. But it seems clear that we have to invest in tangle sync, how to implement it etc.

arj03 commented 1 year ago

Yep, for me EBT compatibility was not a goal, because it is complex. For what is its supposed to deal with: replicating feeds from start to finish it works really well, but it's very hard to do anything else with it. And yes agree on tangle sync.

arj03 commented 1 year ago

I was close to arguing for the tangle backlinks solution until i considered encrypted messages. With this in mind I'm leaning towards the current solution with previous array.

gpicron commented 1 year ago

Can you explain the problem with encrypted messages ?

arj03 commented 1 year ago

If you put membership tangle information or just thread tangle info in the outer layer then you leak information. And if you start encrypting parts of the tangle info then it also gets more complicated. So I think the cleanest solution is to just have previous related to author/feed.

gpicron commented 1 year ago

@arj03 in the bundle structure proposed in https://github.com/ssbc/minibutt-spec/issues/10#issuecomment-1455207795 There is no problem to have one Chaining Block in clear text and one encrypted.

gpicron commented 1 year ago

If you put membership tangle information or just thread tangle info in the outer layer then you leak information. And if you start encrypting parts of the tangle info then it also gets more complicated. So I think the cleanest solution is to just have previous related to author/feed.

I think the subject of private messages and groups need a specific threads. I'll start a new issue.

ssbc / ssb2-discussion-forum

Previous validation #11