Closed staltz closed 1 year ago
Maybe something happens on peer close + restart? I think those tests keep all the peers alive and maybe that's different from what I had? I'm not sure, all I know is that I connected to ~6 different room members and nothing was replicating for me at all.
Clearing the state in disk (on one peer or both peers, doesn't matter) didn't really help. What helps is restarting both peers. It's weird, Alice recover-replicates herself until sequence N, and then it gets stuck. So I restart both Alice and Bob and then Alice recover-replicates herself until sequence N+X and then gets stuck again. Restart, then N+X+Y, and so forth.
Indeed wierd. At least it sounds like it should be possible to replicate.
Alice recover-replicates herself until sequence N
Can you describe more in detail exactly what the steps are?
One possible candidate culprit for the replicate-then-stuck behavior: CPU throttling with the too-hot libraries.
Can you describe more in detail exactly what the steps are?
Same as the original post up there (I did that accident again). Manyverse Android 0.2110.26 (Alice) recovering its feed from Manyverse Desktop 0.2110.22 (Bob) via LAN connection.
Ok, so in your example Bob actually has N+X+Y, it just doesn't give more than N in the first go?
Yes, Bob should have all the known messages from Alice, let's say the total is N+X+Y+Z. I remember that the first time I filed this issue, I also had to restart many times.
I see. Then it makes a lot more sense why we havn't seen it in tests I think. I can have a look at trying to reproduce the error unless you want to?
I should specify that both peers are using ssb-ebt 7.0.4 and epidemic-broadcast-trees 8.0.4, not the latest ssb-ebt 8.0.1 and epidemic-broadcast-trees 9.0.0.
I can have a look at trying to reproduce the error unless you want to?
I wouldn't know where to begin, so it's great if you can try. I can also help out pair programming or something.
Note to self: maybe I should just update ssb-ebt on both peers (keeping ssb-replication-scheduler at the old version) and see what happens. :smiley:
Okay, I'm using the ssb-ebt 8.0.1 in both peers Alice @+UMK
and Bob @QlCT
. @QlCT
had the old ~/.ssb/ebt/@-UMK
file with contents
"@+UMKhpbzXAII+2/7ZlsgkJwIsxdfeFi36Z5Rk1gCfY0=.ed25519": 0,
When @+UMK
connected to @QlCT
over LAN, nothing happened for @+UMK
. No data got replicated, I refreshed a couple of times, and nothing, really. Then after a while (3 min?) it started replicating. But I don't know if it was replicating with @QlCT
or some other room peer.
Hmm, I must say that it seems like it's replicating now everything without needing restarts. I'll put ssb-ebt 8.0.1 in production in Manyverse. Let's re-evaluate this issue if it still gets bug reports after that. For now I'm fine if we park this issue. :)
I nuked my phone's installation again (this time on purpose) to test whether the recent changes to ssb-db2 would fix this issue. Apparently not. @+UMK
was connected to @QlCT
but @+UMK
didn't get any new data. I tried also removing the ebt file from @QlCT
's side and that also didn't have any effect. I ended up connecting to a bunch of other random peers and one of them starting giving me data.
Note: ssb-ebt was at 8.0.1
Huh, I am seeing the same root problem, problems recovering from zero, but with different details.
I am trying to get Planetary/go-ssb to restore an empty profile by syncing with ssb-server. I'm running ssbc/ssb-server commit 919cbdd (head of main), which uses ssb-ebt 5.6.7. If I turn off EBTs and just rely on legacy gossip the whole feed is restored. With EBT replication enabled it seems like ssb-server does not want to send me my own feed. I noticed that the file ~/.ssb/ebt/{my_feed_id}
contains my feed ID and a sequence number greater than 0. Deleting this file doesn't help. Somehow it gets repopulated with the same sequence number that was there before. Where could this sequence number be coming from? Editing the file to set the sequence number for {my_feed_id}
to 0 results in my client fetching exactly 1 message before getting stuck. Restarting both peers does not cause my client to make any more progress.
Oh, this issue should be closed because it's been long since solved, I've tested this use case (restoring feed from scratch) numerous times and the latest versions of ssb-ebt don't have this problem. It seems your issue is with ssb-ebt 5.6.7 that needs to be updated. The latest is ssb-ebt 9.1.0 but ssb-ebt 8.x is also pretty solid (we use it in Manyverse).
I tested with ssb-ebt 8.2.1 and 9.1.0 but both exhibit the same behavior I am seeing on 5.6.7. I also deleted ~/.ssb/ebt
for good measure. The newer versions do seem to have received my notes successfully, as I can see my own feed ID now has a 0
next to it in ~/.ssb/ebt/{my_feed_id}
.
@staltz do you know which commit or PR fixed this issue for you?
I can't at the moment get this information for you because it's Friday evening in my timezone already. Maybe you can figure it out from ssb-ebt commits, epidemic-broadcast-trees commits, and ssb-replication-scheduler. In fact I think ssb-r-s has an integration test specifically covering this issue.
I will try to read through the commit history at some point, but in the meantime I think there should be an open issue for this somewhere. I'm happy to file a new one here or somewhere else, or we can reopen this one. The issue is easily reproducible with version 16.0.1 of ssb-server, so maybe I should file a ticket in that repo?
Perhaps you could start by giving reproduction details and steps so the issue can be reliably experienced on other computers.
Here is the way I have been reproducing it:
I realize these steps aren't generic enough, so @mixmix is helping me write a test for this in JS.
As arj mentioned it looks like there already is such a test in https://github.com/ssbc/epidemic-broadcast-trees/blob/bfcd2cab26ecc2a80875dfdcb75fb2e200ce6175/test/lost-state.js#L64 but when I run the tests in that repo it doesn't seem to execute.
ssb-server 16.0.1 is using ssb-ebt 5.x, which is too old. As I said before, this issue should be resolved in ssb-ebt >=8. Running Planetary as reproduction steps isn't the best because that means that go-ssb's ebt could also be the cause of this bug. I need reproduction steps for an ssb JS peer running ssb-ebt >=8 connecting to another ssb JS peer running ssb-ebt >=8.
Suppose Alice and Bob are friends and have published a lot of content. Both have ssb-ebt installed, and thus there exists
~/.ssb/ebt/aliceID
on Bob's computer and~/.ssb/ebt/bobID
on Alice's computer.If Alice loses her installation and starts from scratch, recovering her
secret
and then reconnecting with Bob, it seems like Bob does not give Alice her feed messages via EBT, when it should. Maybe because of the state persisted on Bob's side, Bob figures that Alice already has all that she needs, when in reality Alice has nothing.I tested this with a fair amount of confidence (not 100%) by attempting to recover my feed and connecting to N room members. None of the connections gave me my data back.