ssbc / ssb2-discussion-forum

not quite tiny, also not quite large
17 stars 1 forks source link

PPPPP replication race conditions #28

Closed staltz closed 7 months ago

staltz commented 7 months ago

Context

I'm implementing the "conductor" in PPPPP which is sort of like ssb-replication-scheduler and bumped into a wider issue.

When replicating, I get the error 'Unknown account tangle owning this msg'. This happens when you replicate (say) a post msg, validation will apply to check that the msg.pubkey is correctly signed by some key in the msg.metadata.account. In other words, in order to accept this msg, we first need to replicate the account tangle. If the local database doesn't know anything about msg.metadata.account, it throws that error.

Problem

So this creates an order of which tangles to replicate: first you must always replicate the account tangle before replicating any feed tangle owned by that account. This is not too difficult to implement, however, there are corner cases:

Suppose you already have Bob's account tangle replicated, and you start replicating Bob's post feed. In one of the msgs in Bob's post feed, you stumble upon msg.pubkey = a911a911a911. The problem now is to figure out whether a911a911a911 is NOT one of Bob's pubkeys (thus this msg should be rejected) or whether we haven't caught up on the latest state of Bob's account tangle. From Alice's perspective, she can't quite distinguish these two, so she has to wait "long enough" to catch up on Bob's account tangle.

The scenario above is not that difficult either. Maybe it's just a matter of always replicating (all) account tangles before trying to replicate any other kind of tangle. After all, if the remote peer is giving you a msg with a certain msg.pubkey, then the remote peer must also have validated msg.pubkey against msg.metadata.account, which means they should have the latest account tangle locally. However, here's another tricky corner case:

When replicating weave tangles such as discussion threads, you don't pre-know the thread participants (their accounts). So when you replicate a discussion thread, you may get a msg with a so-far-unknown msg.pubkey and msg.metadata.account. Then, what do you do? Do you reject this msg? Do you "pause" replication so you can take a break to replicate the account tangle first? (How would "continuation" work in practice, in implementations?)

Vague ideas

Part of me wonders whether there is a better solution we could come up with. Like maybe this could be part of the tangle sync algorithm. Currently it works (roughly) by:

What if it worked like this?:

This could be quite wasteful though. Consider a somewhat big account, e.g. one that has added/removed ~20 devices. And then consider the case of replicating a discussion thread, where you'll have to get N account tangles for the N thread participants. Not to mention that this is wasteful because you may already have all those account tangles in their latest state.

staltz commented 7 months ago

Ok, I have one potential solution: when putting a certain msg ID in the bloom filter, also include msg IDs of the account tangle associated with the msg.metadata.account at the state msg.metadata.accountTips

mixmix commented 7 months ago

Another solution: seperate the sorts of validation

Accept all messages with signatures, but record ones which have not been fully validated against an account.

You can then do follow up queries to try and resolve that missing data

mixmix commented 7 months ago

This happens in life - you hear someone has said something, and you might be like "wait who is that" and I can reply "oh that's my sister!"

staltz commented 7 months ago

Another solution: seperate the sorts of validation

I briefly thought about this too, but we need to think in more details what happens in the corner cases:

ahdinosaur commented 7 months ago

hmm! :smiley_cat:

Accept all messages with signatures, but record ones which have not been fully validated against an account.

my initial reaction is that this approach makes the most sense. seems sensible for high-latency eventual consistency.

... but we need to think in more details what happens in the corner cases:

  • We need to record a boolean for validated or not. Bitvector index?

with Scuttlebutt, you could check whether a message was valid when you received the message. this issue is telling us we can't assume the same for PPPPP. so then i think we should treat validation as multi-faceted (not a single boolean, but many booleans). when you receive a message you can check the signature, but you might need to wait to verify other facets. at the moment we have two facets: "is this message authentic to the referenced ed25519 verifying key?" and "is this message authentic to the referenced account?", but who knows maybe we'll discover more.

  • What if the msg received turned out to be a lie?

i think we shouldn't accept a message as truth if it could later become a lie.

  • What if we never manage to replicate the account tangle?

then it'll stay as a to-be-verified message. could be garbage collected since presumably no goals would want to keep it around.

  • Do we add expiration timeouts for semi-validated received msgs? How much timeout?

nah, i don't think we need a timeout. again, garbage collection would be the timeout.

  • Can we already read and use the msg before it's fully validated? What happens to app logic downstream once we discover the msg was actually invalid?

in the case of posts, i think applications can show these messages to the user, but with an asterisk that says "not yet verified to be Alice". however in the case of more complex states, i don't think the message should be used as an update.

in which case, maybe we shouldn't show or use the message at all until it's considered fully-valid. that'd fit better with "we shouldn't accept a message as truth if it could later become a lie."

staltz commented 7 months ago

Nice to wake up to eager comments, thanks!

Ok, I have one potential solution: when putting a certain msg ID in the bloom filter, also include msg IDs of the account tangle associated with the msg.metadata.account at the state msg.metadata.accountTips

Did anyone give thought to this suggestion I put? I've been thinking about it, and it seems straightforward. If we can avoid a semi-validated pool of msgs, I think that would be best. Having to deal with semi-validated msgs in the database (as well as in application code) sounds like a nightmare to me. It's not a desirable state, doesn't gain us any features, it seems.

The gist is simple: if Bob has a (feed) msg, we presume he should also have the account tangle msg that validates the feed msg. So when Bob transfers the feed msg, he can also transfer the account tangle msg that validates it. This sounds wasteful/verbose, but with bloom filters we can skip sending the account tangle msg if Alice already has it!

Tangle sync recap

As a recap of how tangle sync works, Alice and Bob tell each other what slices of the tangle they have (my haveRange) and based on what the other one has, they can determine what slices they want to receive from the other one (my wantRange).

sequenceDiagram

participant A as Alice
participant B as Bob
note over A: I want to sync tangle<br/>with ID "T" and goal aliceG
note over B: I want to sync tangle<br/>with ID "T" and goal bobG
note over A: aliceHave:= This is what I currently have
A->>B: Phase 1: Send T and aliceHave
note over B: bobHave := This is what I currently have
note over B: bobWant := I am interested in some stuff Alice has
B->>A: Phase 2: Send T, bobHave and bobWant
Note over A: aliceWant := I am interested in some stuff Bob has

Once they know what each other wants and has, they can start calculating precisely what msgs the other side is missing from their wantRange. But for that, we need to know exactly what msgs the other side already has. We don't know that information, we only know their haveRange (which is [minDepth, maxDepth]). Alice could just send a vector of all the message IDs she has, and this would totally work, but it would be wasteful in the case the tangle has 2000 msgs and you just want to update 3 new msgs. The vector transmitted would have 2000 msg IDs!

This is where bloom filters come in. You don't need to understand them in detail, just need to know that they allow Bob to know (approximately) what msgs Alice already has, but without spending so many bytes. It's a bandwidth-cheap way of approximately knowing what msgs Alice has. This way Bob can quite confidently know what msgs she is missing.

Continuing the protocol:

sequenceDiagram

participant A as Alice
participant B as Bob
Note over A: aliceBF0 := bloomFor(T, 0, aliceWant)
A->>B: Phase 3: Send T, aliceWant and aliceBF0
Note over B: aliceMiss0 := msgsMissing(T, 0, aliceWant, aliceBF0)
Note over B: bobBF0 := bloomFor(T, 0, bobWant)
B->>A: Phase 4: Send T, bobBF0 and aliceMiss0

aliceMiss0 is now a list of msg IDs that Bob discovered that Alice is missing, based on the "set" of msg IDs she already has (aliceBF0). The protocol continues with a few more rounds of this. You can see more details in the protospec, this is enough recap.

Bloom filter to include account tangle

:bulb:

Here's the tweak: when Alice is informing Bob of the msgs she already has in this tangle (aliceBF0) she can also include the account tangle msg IDs associated with those (feed) tangle msgs. This way, when Bob runs msgsMissing, he can figure out what account tangle msgs she is missing.

At the end of the 3 rounds of bloom filter exchanges, each side will know the account tangle msgs that the other side is missing. Then they send those msgs (both feed and account) to each other. When receiving these msgs, you can sort such that account tangles are first (if they didn't already arrived sorted), and then database-persist (thus validating) the account tangle msgs first. So that by the time you insert the feed msgs, the account tangle msgs are already available to assist msg.metadata.account validation.

Seems to me like it could work!

staltz commented 7 months ago

Good news! I finished implementing this idea in tangle sync, and I think it's working!

staltz commented 7 months ago

To me this is a closed issue. :)