ssbc / ssb-ebt

secure scuttlebutt replication with epidemic-broadcast-trees
MIT License
18 stars 10 forks source link

Recycling ooo messages when replicating with EBT #80

Open staltz opened 1 year ago

staltz commented 1 year ago

Context

With partial replication, you may have added out-of-order (OOO) messages belonging to someone's main feed, such as metafeed/announce or indexed messages.

However, if you switch from partial replication to full replication (e.g. if you follow the person and begin requesting their full main feed), then we will have to avoid duplicate messages, so we probably have to delete all the OOO messages for that main feed, and then redownload them in-order as part of the full feed.

Problem

In practice, if there are several OOO messages (if not for one feed, then at least this problem becomes clear when there are hundreds of feeds switching from partial replication to full replication), then we are deleting data from disk just to immediately request that same data from remote peers. This can be wasteful in bandwidth costs, and in overall replication performance.

Potential solutions

Instead of first deleting the OOO messages, we could just start replicating the full main feed. Then when we would fetch msg sequence 8 (which is, say, already stored OOO), we would move that message in the log, by first deleting it and then appending it to the log. Then proceed to requesting message 9 and so forth. If you stumble upon another OOO msg, say at sequence 15, then move it in the log too instead of requesting.

Potential problem: I don't know exactly how EBT does this but it seems to me like it doesn't request one sequence num at a time, instead it just tells remote peers what is the latest sequence that it has, and then those remote peers flood me with answers.

Potential solution to that potential problem: if that's the case, perhaps we could tell remote peers that my latest sequence for that full feed is 7 and then once I get all of that up to 7, then I (offline) recycle my OOO msg sequence 8 by moving it in the log, and then I come back to the EBT network and tell remote peers that my new latest sequence is 14 and that my current position is 8.

Hmmm, but it seems like in EBT, you don't get to say a range of sequence numbers, you can only inform the lower bound for sequence numbers. You can say "give me everything after sequence X" but you can't say "give me everything between sequence X and sequence Y".

So maybe we could change EBT such that it supports these ranges? That would then constitute a good solution also for #79 .

cc @arj03

staltz commented 1 year ago

Wow, I just had a thought:

Currently EBT supports specifying a lower bound, i.e. the sequence number for the latest msg you have for a feed. This tells peers that you have everything up until that sequence, and that you're only requesting for msgs ABOVE it, thus lower bound. But what if you're lying? Then you have a way of doing sliced replication. (Of course this assumes you are only leeching, not seeding. If you declare you have msgs up until that sequence number, but you can't seed any of them, then peers will know you're lying or you're selfish)

Now that is the current situation. If we add an upper bound, then you can request msgs from a range, as a leecher. But as a seeder, this means that you refuse to give msgs larger than a certain sequence number.

Overall having both lower bound and upper bound suddenly give us sliced replication, and this is ... very interesting, because the range slice can be positioned anywhere in the feed. E.g. for OOO recycling, you first slice replicate all the non-OOO ranges you want, and after you receive everything, you can switch to fully replicating the feed.

staltz commented 1 year ago

Yet another comment (different enough that it's okay to start a new one):

If you are leeching, then the latest sequence number you declare is actually a lower bound for what msgs you want to leech. BUT from the perspective of seeding, that same sequence number is an upper bound, for what msgs you can provide.

If we introduce the other bound, let's call it "limit sequence", such that latest < limit, then as a leecher my "limit" is an upper bound because I don't want to get msgs after that. But (and this is where it gets interesting) as a seeder, it is also an upper bound because latest < limit. But because latest is a tighter upper bound, then it just means that when seeding, limit is pointless and we can just conclude that you have all messages up until latest.

If you're doing sliced replication, it could be that you're lying and you don't have all messages up until latest, but it doesn't matter much because (if my understanding of EBT is correct) you're the one who uploads msgs for other peers' range requests, and if have missing messages then you can just refuse to upload them by staying silent.

arj03 commented 1 year ago

I'm a bit sceptical mostly because this can be tricky switching between the different modes, moving messages around etc. I wonder if it's possible to quickly switch mode when enough messages have been received. Then at least this could be a local changes instead of a new api. I'm up for trying testing this.

staltz commented 1 year ago

Yeah, and a note: this issue is low priority because it's about bandwidth optimization. Implementing or experimenting with this can happen after things are working.

However, a functional requirement of private groups may require #79, which is similar to #80, but doesn't have "modes".