ssbc / ssb-ebt

secure scuttlebutt replication with epidemic-broadcast-trees
MIT License
18 stars 10 forks source link

Not replicating actual friends #61

Closed staltz closed 3 years ago

staltz commented 3 years ago

I'm getting a lot of bug reports in Manyverse that things are not replicating between two peers currently connected and mutually following. Finally I managed to reproduce it myself, and here's one thing I found:

File ~/.ssb/ebt/@QlCT on @+UMK looked like this:

{
  "@QlCTpvY7p9ty2yOFrv1WU1AE88aoQc4Y7wYal7PFc+w=.ed25519": 8601,
  "@+UMKhpbzXAII+2/7ZlsgkJwIsxdfeFi36Z5Rk1gCfY0=.ed25519": 5309,
  ...
}

File ~/.ssb/ebt/@-UMK on @QlCT looked like this:

{
  "@+UMKhpbzXAII+2/7ZlsgkJwIsxdfeFi36Z5Rk1gCfY0=.ed25519": 5309,
  "@QlCTpvY7p9ty2yOFrv1WU1AE88aoQc4Y7wYal7PFc+w=.ed25519": -1,
  ...
}

As you can see, that -1 seems problematic. It means that QlCT is forbidding +UMK to get new updates, perhaps mistaking it as some disallowed peer. Could this be some initialization bug?

cc @arj03

staltz commented 3 years ago

Some users said that Manyverse v0.2108.2 was working fine, and that one had

Here are the diffs between

staltz commented 3 years ago

Learned some really strange things about this bug!

I kept two terminal tabs open, one that was constantly monitoring the contents of the ebt file on the phone, and the other was constantly monitoring the ebt file on the desktop. Once, the @-UMK file on the desktop showed 8592 instead of -1! Then I disconnected the peers (by shutting down the mobile peer), and the desktop went back to showing the number -1 in the file.

Note that the number went to 8592 but never to 8601 which would have been the correct "latest" content. And also the above experiment didn't happen consistently. I started both peers again and again and only SoMeTiMeS did it do the sudden jump to 8592. Most of the times it stayed on -1.

A couple of times, one of the files (either the mobile or the desktop) would suddenly become all blank, like the whole file was just the JSON {}. Then, sometime later, randomly, it would go back to usual.

staltz commented 3 years ago

Finally, I took the nuclear option of deleting the EBT files @-UMK on desktop and @QlCT on mobile, and after a while, the files were re-created with 8592 coming back (not the expected 8601), and after a while (~2min) the number -1 came back.

arj03 commented 3 years ago

I can have a look tonight. My hunch is that it is something related to initialization. Where notes is called before things are properly set up. Resulting in -1.

staltz commented 3 years ago

Thanks! :pray:

Where notes is called before things are properly set up. Resulting in -1.

What could be causing the lack of replication even though the number is 8592? (less than 8601)

staltz commented 3 years ago

I now reverted ssb-ebt to 7.0.3 and epidemic-broadcast-trees to 8.0.4 and I still see the exact same behavior (and re-starting the peers over and over again many times). So at least we can rule out new changes to code. I also removed the ebt files, just to make sure, and same behavior.

To make sure we're not dealing with a feed fork, I ended up checking what is the message at 8592 by QlCT, and it's a private message. I looked at the JSON of that message on both desktop and mobile and they are identical except for the KVT timestamp.

mixmix commented 3 years ago

I want to lend energy to this quest. Please let me know if having a person pairing on this would help (sometimes a small party is good when hunting down an elusive and dangerous beast)

staltz commented 3 years ago

Yeah it would be awesome to pair. I don't know a lot of the inner details of epidemic-broadcast-trees but it's super useful that I can reproduce it on two peers and tweak code for both quickly.

arj03 commented 3 years ago

Indeed. I'll try and prepare a debugging branch of both SSB-EBT and EBT that should give us an idea of what is going on.

arj03 commented 3 years ago

Can you try and run desktop and phone with these 2 branches and report what output you get?

https://github.com/ssbc/ssb-ebt/pull/62 https://github.com/ssbc/epidemic-broadcast-trees/pull/55

arj03 commented 3 years ago

As for pairing, agree it would be great to pair on this issue. Especially since the bug can be reproduced. I'm good with remote debugging like this. EBT is a state beast and the code is really hard to read. Itching to tidy up the code without making big changes just to make it easier to understand.

Edit: while reading some of the code I decided to do a tiny refactor of things: https://github.com/ssbc/epidemic-broadcast-trees/pull/56

staltz commented 3 years ago

I'll try tomorrow for sure! :)

mixmix commented 3 years ago

Does it have to be phone - phone or can we reproduce it locally with scuttle-testbot and it's connect function. Maybe hook some methods to watch the things trigger

arj03 commented 3 years ago

A few things we discovered:

arj03 commented 3 years ago

https://github.com/ssbc/epidemic-broadcast-trees/pull/57

staltz commented 3 years ago

On top of what arj already mentioned, there was also a bug in ssb-validate2 where a fork was detected, but as a false positive.

I discovered what the problem there is: the validation code is recreating the key for a private message using ssbKeys.hash(JSON.stringify(msgval,null,2)) but the msgVal's content is actually the unboxed object, while it should be the boxed string, for the calculation of the key to be correct.

arj03 commented 3 years ago

Damn. Thought it was something like that. The good thing is that once these things are deployed replication should start working again.

staltz commented 3 years ago

The fix is so obvious, it's going to hurt to review. :D

The good thing is that once these things are deployed replication should start working again.

Yes, no need to fiddle with ebt file recovery.

staltz commented 3 years ago

Pretty sure this can be closed. Thanks @arj03 !