ssbc / ssb-ebt

secure scuttlebutt replication with epidemic-broadcast-trees
MIT License
18 stars 10 forks source link

Why replicate only with connections that the peer initiated? #33

Closed staltz closed 3 years ago

staltz commented 4 years ago

I might have found the cause of issue https://github.com/staltz/ssb-conn/issues/11, cc @christianbundy

When ssb-ebt detects a new connection, it first checks whether isClient is true (meaning, this RPC connection is one that this peer initiated), and it only applies the EBT replication algorithm if it was true.

https://github.com/ssbc/ssb-ebt/blob/6bd6679ed71e191c11fa86a3c04b1850c5cf4dd6/index.js#L122-L123

The problem is, that's not what we want in some cases. For instance, if you connect to a room and there is a friend of yours there online, they will initiate connection with you as soon as possible, and then we will receive an 'rpc:connect' with isClient=false because we didn't initiate the connection, and EBT will ignore that code snippet.

Maybe this is not a problem that often, but I noticed a problem when doing a backup recover, described below:

Thing is, if I turn off ssb-ebt for Jane, and leave only ssb-replicate on, then during the Important Causal Moment, Jane will start syncing data with Sarah, and will get all her feed (so Important Consequence Moment turns out fine).

I also tried keeping both ssb-ebt and ssb-replicate on for Jane, with a small change to ssb-ebt: I removed the if (isClient) { check, keeping only the body of the if block. And this also caused Jane to correctly recover all her data from Sarah.

So my question is: why do we need if (isClient)? And could we solve this issue by either removing that if condition or by fixing code elsewhere?


Update:

Now I'm trying to figure out why the tests don't run (hence fail) when I comment out if (isClient).

But I ran Patchwork while EBT's if (isClient) was commented out, and everything seemed to work fine.

christianbundy commented 4 years ago

Thanks for this investigation! FYI, when I comment out the condition then the tests still run (and pass). Full diff, just in case it's useful:

diff --git a/index.js b/index.js
index ef91a5e..5fbc76e 100644
--- a/index.js
+++ b/index.js
@@ -120,19 +120,17 @@ exports.init = function (sbot, config) {

   sbot.on('rpc:connect', function (rpc, isClient) {
-    if(isClient) {
-      var opts = {version: 3}
-      var a = toPull.duplex(ebt.createStream(rpc.id, opts.version, true))
-      var b = rpc.ebt.replicate(opts, function (err) {
-        if(err) {
-          rpc.removeListener('closed', onClose)
-          rpc._emit('fallback:replicate', err)
-        }
-      })
+    var opts = {version: 3}
+    var a = toPull.duplex(ebt.createStream(rpc.id, opts.version, true))
+    var b = rpc.ebt.replicate(opts, function (err) {
+      if(err) {
+        rpc.removeListener('closed', onClose)
+        rpc._emit('fallback:replicate', err)
+      }
+    })

-      pull(a, b, a)
-      rpc.on('closed', onClose)
-    }
+    pull(a, b, a)
+    rpc.on('closed', onClose)
   })

   function block (from, to, blocking) {

$ npm version
{
  'ssb-ebt': '5.6.7',
  npm: '6.13.4',
  ares: '1.15.0',
  brotli: '1.0.7',
  cldr: '36.0',
  http_parser: '2.8.0',
  icu: '65.1',
  llhttp: '1.1.4',
  modules: '72',
  napi: '5',
  nghttp2: '1.39.2',
  node: '12.14.0',
  openssl: '1.1.1d',
  tz: '2019c',
  unicode: '12.1',
  uv: '1.34.0',
  v8: '7.7.299.13-node.16',
  zlib: '1.2.11'
}
staltz commented 4 years ago

Interesting, here's my npm version:

{
  'ssb-ebt': '5.6.7',
  npm: '6.4.1',
  ares: '1.15.0',
  cldr: '33.1',
  http_parser: '2.8.0',
  icu: '62.1',
  modules: '64',
  napi: '3',
  nghttp2: '1.34.0',
  node: '10.15.3',
  openssl: '1.1.0j',
  tz: '2018e',
  unicode: '11.0',
  uv: '1.23.2',
  v8: '6.8.275.32-node.51',
  zlib: '1.2.11'
}
staltz commented 4 years ago

I think Travis has the same problems as my machine has: https://travis-ci.org/ssbc/ssb-ebt/jobs/632002154?utm_medium=notification&utm_source=github_status

dominictarr commented 4 years ago

There are two reasons for this behaviour. The first is that sometimes peers connect, without intending to replicate. for example, to redeem an invite code, or it's just the cli or other front end that's connecting. If such a connection is a new one, then EBT will send a large vector clock on the handshake, but there is no point doing this unless it's a real peer. The peer connecting knows which it is, so just let them create the replication stream. Secondly, the replication stream is duplex. it's created once, then data flows both ways. If you remove the isClient check both ends will create a stream! So when these peers realize they are replicating with the same peer twice, I think they will close the old stream(?) that means both peers may stop replicating?

Alternative approach: why is Jane not emitting rpc:connect? If she initiated the connection, that event should be triggered, and isClient should pass, so sarah should get the data from jane when jane starts the replication.

dominictarr commented 4 years ago

idea: this may be caused by an edge case in request skipping. assuming jane does call sarah.ebt.replicate() but, because she was previously up to date, sends an empty clock {} sarah has amnesia so doesn't remember who jane is, just sees a stranger that doesn't want to replicate any feeds (but has called ebt.replicate anyway, with an empty clock. This would mean nothing happens... but if jane always included her feed the clock would be {<sarah.id>: 0} to which jane would respond with {<sarah.id>: <some_number>} and replication would start. Legacy replication doesn't have request skipping (which makes it use a lot more bandwidth, and that is why it still works)

This would only happen if sarah and jane where already up to date.

Hmm, if there are multiple feeds, maybe need to disable request skipping on all of those, at least until after recovery.

staltz commented 4 years ago

It sounds to me that it's more likely that this is an edge case of request skipping, but to clarify how this bug was reproduced:

Alternative approach: why is Jane not emitting rpc:connect? If she initiated the connection, that event should be triggered, and isClient should pass, so sarah should get the data from jane when jane starts the replication.

Jane did not initiate the connection, Sarah did. Essentially, Jane has an empty database (because she just recovered her secret and nothing else) and has connected to a room, and Sarah is also online in that room, while Sarah has a full database (with feeds from both). It's Sarah who initiates the connection, but Sarah has isClient=true.

I wish I could understand the EBT codebase better to provide a better diagnosis of what's going on and how to fix it, but I don't yet understand. There's also too many shared concerns between ssb-friends, ssb-replicate, ssb-ebt, and to some degree ssb-conn. Like, ssb-ebt doesn't do just EBT, it handles blocks. And ssb-friends doesn't just calculate friends, it also triggers replication. And the whole ssb-ebt is written to hook into and interfere with ssb-replicate. It's a difficult part of the stack.

For now I'm deploying ssb-ebt with a hack to comment out that if (isClient), and this actually fixes the bugs I've seen, although it's probably being more inefficient.

dominictarr commented 4 years ago

@staltz it's necessary for replicate to know about blocks. I wouldn't have done that, but to have peers not give out messages by A to peer B blocked by A it was necessary. having friends trigger replication means that ebt just does replication. It is hard because replication is hard.

If you are gonna just uncomment things and ship it, because it fixes bugs you are observing... that's likely gonna lead to more hard to diagnose bugs. Replication is the hard, most core part, that's why things like this needs test cases, and correct fixes. Some things are just that hard and you need to embrace it.

I was confused by "sarah" and "jane'. When I use "alice" and "bob", "alice" is always the initiator because A comes first. so sarah (alice) is the client and connects to jane (bob) who has amnesia. sarah_alice calls jane_bob.replicate.ebt() but replication doesn't start...

dominictarr commented 4 years ago

@staltz it's necessary for replicate to know about blocks. I wouldn't have done that, but to have peers not give out messages by A to peer B blocked by A it was necessary. having friends trigger replication means that ebt just does replication. It is hard because replication is hard.

If you are gonna just uncomment things and ship it, because it fixes bugs you are observing... that's likely gonna lead to more hard to diagnose bugs. Replication is the hard, most core part, that's why things like this needs test cases, and correct fixes. Some things are just that hard and you need to embrace it.

I was confused by "sarah" and "jane'. When I use "alice" and "bob", "alice" is always the initiator because A comes first. so sarah (alice) is the client and connects to jane (bob) who has amnesia. sarah_alice calls jane_bob.replicate.ebt() but replication doesn't start...

staltz commented 4 years ago

Have you discovered what request skipping corner case are we having with this issue? I need to fix critical end-user issues and I'd love to do it the right way but unless you help me understand how this bug works (because ultimately there's not many who understand these modules) then you shouldn't tell me to avoid the wrong way of fixing it for people who don't care much how it's fixed, as long as it's fixed asap. I frequently ship something that works, and then gradually over time make it "right".

dominictarr commented 4 years ago

@staltz if you ship the quick fix, but make a test case that reproduces it, we can figure out the right way and I'll be happy.

Do you think you can reproduce this reliably?

staltz commented 4 years ago

@dominictarr I will try to write the test, but it's reproducible reliably, see the original post that lists the actions taken to reproduce.

stale[bot] commented 4 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

KyleMaas commented 3 years ago

Pretty sure I hit this with https://github.com/arj03/ssb-browser-demo/issues/186

staltz commented 3 years ago

@arj03 Do you think this issue is solved by the recent changes in epidemic-broadcast-tree?

arj03 commented 3 years ago

I think it should. Please try EBT 8.0.1 and see if that solves the problem.

staltz commented 3 years ago

There's no version of ssb-ebt that uses epidemic-broadcast-trees 8.0.1, right? Currently ssb-ebt depends on e-b-t 7

arj03 commented 3 years ago

Correct.

staltz commented 3 years ago

I made ssb-ebt@6.0.0-rc1

staltz commented 3 years ago

Yeah, I can't reproduce this bug anymore with 6.0.0-rc1 (honestly, neither with the version in master branch, no idea why not, even though I followed the original reproduction steps), so closing.