ssbc / ssb-ebt

secure scuttlebutt replication with epidemic-broadcast-trees
MIT License
18 stars 10 forks source link

Problems when peer is recovering their feed from zero #59

Closed staltz closed 1 year ago

staltz commented 2 years ago

Suppose Alice and Bob are friends and have published a lot of content. Both have ssb-ebt installed, and thus there exists ~/.ssb/ebt/aliceID on Bob's computer and ~/.ssb/ebt/bobID on Alice's computer.

If Alice loses her installation and starts from scratch, recovering her secret and then reconnecting with Bob, it seems like Bob does not give Alice her feed messages via EBT, when it should. Maybe because of the state persisted on Bob's side, Bob figures that Alice already has all that she needs, when in reality Alice has nothing.

I tested this with a fair amount of confidence (not 100%) by attempting to recover my feed and connecting to N room members. None of the connections gave me my data back.

arj03 commented 2 years ago

Strange. There should be a test for that in EBT.

staltz commented 2 years ago

Maybe something happens on peer close + restart? I think those tests keep all the peers alive and maybe that's different from what I had? I'm not sure, all I know is that I connected to ~6 different room members and nothing was replicating for me at all.

staltz commented 2 years ago

Clearing the state in disk (on one peer or both peers, doesn't matter) didn't really help. What helps is restarting both peers. It's weird, Alice recover-replicates herself until sequence N, and then it gets stuck. So I restart both Alice and Bob and then Alice recover-replicates herself until sequence N+X and then gets stuck again. Restart, then N+X+Y, and so forth.

arj03 commented 2 years ago

Indeed wierd. At least it sounds like it should be possible to replicate.

Alice recover-replicates herself until sequence N

Can you describe more in detail exactly what the steps are?

staltz commented 2 years ago

One possible candidate culprit for the replicate-then-stuck behavior: CPU throttling with the too-hot libraries.

Can you describe more in detail exactly what the steps are?

Same as the original post up there (I did that accident again). Manyverse Android 0.2110.26 (Alice) recovering its feed from Manyverse Desktop 0.2110.22 (Bob) via LAN connection.

arj03 commented 2 years ago

Ok, so in your example Bob actually has N+X+Y, it just doesn't give more than N in the first go?

staltz commented 2 years ago

Yes, Bob should have all the known messages from Alice, let's say the total is N+X+Y+Z. I remember that the first time I filed this issue, I also had to restart many times.

arj03 commented 2 years ago

I see. Then it makes a lot more sense why we havn't seen it in tests I think. I can have a look at trying to reproduce the error unless you want to?

staltz commented 2 years ago

I should specify that both peers are using ssb-ebt 7.0.4 and epidemic-broadcast-trees 8.0.4, not the latest ssb-ebt 8.0.1 and epidemic-broadcast-trees 9.0.0.

staltz commented 2 years ago

I can have a look at trying to reproduce the error unless you want to?

I wouldn't know where to begin, so it's great if you can try. I can also help out pair programming or something.

staltz commented 2 years ago

Note to self: maybe I should just update ssb-ebt on both peers (keeping ssb-replication-scheduler at the old version) and see what happens. :smiley:

staltz commented 2 years ago

Okay, I'm using the ssb-ebt 8.0.1 in both peers Alice @+UMK and Bob @QlCT. @QlCT had the old ~/.ssb/ebt/@-UMK file with contents

 "@+UMKhpbzXAII+2/7ZlsgkJwIsxdfeFi36Z5Rk1gCfY0=.ed25519": 0,

When @+UMK connected to @QlCT over LAN, nothing happened for @+UMK. No data got replicated, I refreshed a couple of times, and nothing, really. Then after a while (3 min?) it started replicating. But I don't know if it was replicating with @QlCT or some other room peer.

staltz commented 2 years ago

Hmm, I must say that it seems like it's replicating now everything without needing restarts. I'll put ssb-ebt 8.0.1 in production in Manyverse. Let's re-evaluate this issue if it still gets bug reports after that. For now I'm fine if we park this issue. :)

staltz commented 2 years ago

I nuked my phone's installation again (this time on purpose) to test whether the recent changes to ssb-db2 would fix this issue. Apparently not. @+UMK was connected to @QlCT but @+UMK didn't get any new data. I tried also removing the ebt file from @QlCT's side and that also didn't have any effect. I ended up connecting to a bunch of other random peers and one of them starting giving me data.

Note: ssb-ebt was at 8.0.1

mplorentz commented 1 year ago

Huh, I am seeing the same root problem, problems recovering from zero, but with different details.

I am trying to get Planetary/go-ssb to restore an empty profile by syncing with ssb-server. I'm running ssbc/ssb-server commit 919cbdd (head of main), which uses ssb-ebt 5.6.7. If I turn off EBTs and just rely on legacy gossip the whole feed is restored. With EBT replication enabled it seems like ssb-server does not want to send me my own feed. I noticed that the file ~/.ssb/ebt/{my_feed_id} contains my feed ID and a sequence number greater than 0. Deleting this file doesn't help. Somehow it gets repopulated with the same sequence number that was there before. Where could this sequence number be coming from? Editing the file to set the sequence number for {my_feed_id} to 0 results in my client fetching exactly 1 message before getting stuck. Restarting both peers does not cause my client to make any more progress.

staltz commented 1 year ago

Oh, this issue should be closed because it's been long since solved, I've tested this use case (restoring feed from scratch) numerous times and the latest versions of ssb-ebt don't have this problem. It seems your issue is with ssb-ebt 5.6.7 that needs to be updated. The latest is ssb-ebt 9.1.0 but ssb-ebt 8.x is also pretty solid (we use it in Manyverse).

mplorentz commented 1 year ago

I tested with ssb-ebt 8.2.1 and 9.1.0 but both exhibit the same behavior I am seeing on 5.6.7. I also deleted ~/.ssb/ebt for good measure. The newer versions do seem to have received my notes successfully, as I can see my own feed ID now has a 0 next to it in ~/.ssb/ebt/{my_feed_id}.

@staltz do you know which commit or PR fixed this issue for you?

staltz commented 1 year ago

I can't at the moment get this information for you because it's Friday evening in my timezone already. Maybe you can figure it out from ssb-ebt commits, epidemic-broadcast-trees commits, and ssb-replication-scheduler. In fact I think ssb-r-s has an integration test specifically covering this issue.

mplorentz commented 1 year ago

I will try to read through the commit history at some point, but in the meantime I think there should be an open issue for this somewhere. I'm happy to file a new one here or somewhere else, or we can reopen this one. The issue is easily reproducible with version 16.0.1 of ssb-server, so maybe I should file a ticket in that repo?

staltz commented 1 year ago

Perhaps you could start by giving reproduction details and steps so the issue can be reliably experienced on other computers.

mplorentz commented 1 year ago

Here is the way I have been reproducing it:

  1. Run a pub using ssb-server 15.0.3 or 16.0.1
  2. Run Planetary and join the pub
  3. Verify that the pub has replicated the Planetary feed
  4. Delete the database from Planetary and relaunch Expected: Planetary connects to the pub and replicates its own feed. Actual: Planetary connects to the pub and sends EBT notes asking for it's own feed but the pub never sends it.

I realize these steps aren't generic enough, so @mixmix is helping me write a test for this in JS.

As arj mentioned it looks like there already is such a test in https://github.com/ssbc/epidemic-broadcast-trees/blob/bfcd2cab26ecc2a80875dfdcb75fb2e200ce6175/test/lost-state.js#L64 but when I run the tests in that repo it doesn't seem to execute.

staltz commented 1 year ago

ssb-server 16.0.1 is using ssb-ebt 5.x, which is too old. As I said before, this issue should be resolved in ssb-ebt >=8. Running Planetary as reproduction steps isn't the best because that means that go-ssb's ebt could also be the cause of this bug. I need reproduction steps for an ssb JS peer running ssb-ebt >=8 connecting to another ssb JS peer running ssb-ebt >=8.