ssbc / ssb-db2

A new database for secure-scuttlebutt
47 stars 8 forks source link

`addOOO` should not update Base index #395

Closed staltz closed 1 year ago

staltz commented 1 year ago

In general if we do add ooo messages then it should not update this index.

From https://github.com/ssbc/ssb-ebt/pull/72#issuecomment-1274832766

Same should happen with addTransaction because it uses addOOO under the hood, I suppose.

arj03 commented 1 year ago

I have been thinking about this and here is a dump of my notes.

We need to avoid duplicates in the database. Base index takes care of EBT for us. Index replication needs to do a check as well. And then we have addooo in db2 that checks the database for duplicates before inserting. So far so good, sadly there is another really tricky case with base index and EBT where we switch from index replication to full replication. For this to work to we need some way for know that a feed is in a partial state and delete those partial messages before doing normal full replication. Otherwise we will also have duplicates. So of this smells like a ton of corner cases.

In db1 there was only 1 type of ooo, In a thread you could fetch messages from outside your hops. And those would be stored in a separate db. The messages in this db was never part of any indexes. Iirc you just checked in the db given the msg key if you were missing a message.

Another way to solve this whole thing is to store these ooo messages in a separate log. This log could have jitdb indexes. I'm not entirely sure what the best API would be, if we need it in db2 where it checks in the main db first and then falls back the ooo db afterwards. Or of we say that these ooo messages are used in very specific places: metafeeds, about, threads(maybe?) That those modules could do the extra query. In any case it is not really a bit problem that we have messages in both databases. And this would make things a lot easier to reason about. Secondly an index like base would never know about this ooo db.

staltz commented 1 year ago

Hmm, interesting solution. I didn't know how OOO worked in db1.

Could two logs just complicate things further? For db.query() we would have to fetch messages from two different places, using two different sets of indexes. And it doesn't really solve the problem of duplication because when switching from index replication to full replication, we would still have the message on the OOO log and on the normal log. Who knows how complicated compaction would become...

What if...

For this to work to we need some way for know that a feed is in a partial state

We just store this information somewhere? Like, remember that in buttwoo, the feedId isn't necessarily kvt.value.author? In db2, in those specific cases we are storing KVTF objects ({key, value, timestamp, feed}) instead of KVTs, so that the base index knows how to do the accounting for feedId.

We could do something similar for OOO cases, we could store (for instance) {key, value, timestamp, ooo: true} objects. Then, when loading the base index into the state object, we can skip KVTOOO objects. And when indexing the base index, we can seek for the ooo field and skip the record. Or we could insert a special value in the base index that flags that this feedId is in OOO mode. Then, we can reliably know whether a feed is in "partial state" or not.

This would solve (I think?) the case for switching between full replication => indexed replication, and indexed replication => full replication. It wouldn't (I think?) allow doing OOO to fetch messages missing a thread, because every feedId would either be "replicated in full" or "replicated OOO" (lets call them "modes"), with no gray area in between the modes. But we can allow for that use case by first deleting a feedId before changing the mode. We have compaction to help us out for those cases.

arj03 commented 1 year ago

I wanted to make sure we explored all options here, as this can be a bit tricky to support properly. Thanks for walking through the options with me 🙂

Could two logs just complicate things further?

Possible, i think for what we have now it should be ok. But if we do friends using indexes then it gets complicated.

{key, value, timestamp, ooo: true}

That is a great idea. Then we can have the base index in charge of it a feed is ooo or not, and have replication scheduler use that. If it removes the ooo flag on seeing a msg with seq 1 then i think the "switch from partial to full" should be crash safe.

staltz commented 1 year ago

If it removes the ooo flag on seeing a msg with seq 1 then i think the "switch from partial to full" should be crash safe.

Nice optimization.

A new question: what was the reason why we can't messages stored in the wrong order? Suppose I'm replicating Alice's messages out of order, msg5, msg1, msg3. Then, I decide to switch to full replication, and I can just replicate msg2, msg4, and add them to the log, so I end up with this order: msg5, msg1, msg3, msg2, msg4. Then, when we call getAtSequence, ssb-db2 just does the leveldb lookup for [aliceId, sequence], it spits out an offset, and then I pass the msg at that offset to the remote EBT peer.

Would we have to delete Alice's OOO messages and refetch them "in order"?

arj03 commented 1 year ago

Well you could just fetch what you need, but then EBT needs to know what you have or do that another way. Much easier to delete and use the EBT hammer 🙂

staltz commented 1 year ago

Well you could just fetch what you need, but then EBT needs to know what you have or do that another way. Much easier to delete and use the EBT hammer slightly_smiling_face

Yeah, before I saw your reply I was typing this:

Then, I decide to switch to full replication, and I can just replicate msg2, msg4

I'm thinking how this would play out in practice. I assume that when we switch to full replication for Alice's feed, that just means doing normal classic EBT format replication. But we could change classic EBT format such that it checks base index, and if base index says that alice is in OOO mode, then we can just tell the remote EBT peer that we want to replicate all of alice's feed from seq 1 onwards, except when we get the messages, we just check with getAtSequence whether we have it already or not, and skip adding the ones we already have (maybe just removing the .ooo field from them).

I'm not sure what this would look like in terms of performance.

arj03 commented 1 year ago

Based on the index benchmark probably around 1/4 the speed. I was also concerned about the validation in db2 of these messages. Some you would have to add as addooo.

arj03 commented 1 year ago

I can take this one, the task seems to be well specified