connections die somtimes (and don't recover)

mixmix commented 5 years ago

setup : 2 patchbay based peers on the same network. One long standing, one "new" (no log), with connections locked down to be local-only:

var Config = require('ssb-config/inject')
const config = Config('ssb', {
  gossip: {
    local: true,
    friends: false,
    global: false,
    seed: false
  }
})

config.connections.outgoing.net[0].scope = ['device', 'local']
config.connections.incoming.net[0].scope = ['device', 'local']

what we see : beautifully fast replication (say 4+MB/s), and then at some point (say after 100 or 400 MB) something about the connection with the other local peer goes wrong, and the log stops getting data. sbot status polls show that the peer trying to pull fresh data continues to try to reconnect (as the scheduler is doing its job, and this is the only legal connection), but it never connects again. If I close the "new" peer and start it again it picks up where it left off fine, but might get stuck again.

guess : It seems like it's getting itself into some polluted state, but I don't know what the problem is.

meanwhile ssb-replicate just works, but it's dramatically slower (1MB/s) :cry:

@mmckegg and @pietgeursen have reported the same things in their recent work.

mixmix commented 5 years ago

If this bug got resolved, this changes the data grabbing part of initialSync from 10 minutes down to 2.5 minutes!

The indexing is just slow at the moment, but that's probably all going to change with some of that new rust work

dominictarr commented 5 years ago

my hunch is that it's queuing up too much, and blocking the cpu and then timing out?

beautifully fast replication (say 4+MB/s)

are you measuring this at the send side or the receive side?

christianbundy commented 5 years ago

@mixmix

Thanks for the awesome bug report. I got a bit mixed up between the two peers while reading the "what we see" bit, can you verify that this is correct?

Alice has no data
Bob has some data
Alice sees beautifully fast download (say 4+MB/s)
at some point (say after 100 or 400 MB) something about the connection with Bob goes wrong
- Alice stops receiving data
- sbot status polls show that Alice continues to try to reconnect
- Alice never connects to Bob again
If I close Alice and start it again it picks up where it left off fine, but might get stuck again.

Names [obviously] don't matter, but I want to make sure I understand the setup correctly. Also, are they both configured the same (e.g. does Bob also try to connect to Alice) or do they have designated roles as uploader and downloader?

christianbundy commented 5 years ago

Last note: is there any test code or steps that you could recommend to reproduce? I've only got one machine but I'd really like to get this fixed ASAP.

arj03 commented 5 years ago

I wonder if https://github.com/ssbc/ssb-ebt/pull/15 could help with this problem.

christianbundy commented 5 years ago

@pietgeursen What do you think?

pietgeursen commented 5 years ago

Matt and I spent a lot of time debugging this a few months ago in patchwork. We had the code from #15 running and it didn't seem to change things.

What we saw:

legacy replication connections never dropped out

With ebt:

doing a gossip.reconnect was enough to get it going again
we didn't see the problem when doing an initial sync on a lan. We did see it from a pub.
- tested with ssb.mikey.nz and ssb.organicdesign.pub
we did see it with sbot 13 & 14

On Tue, Apr 30, 2019 at 6:59 AM Christian Bundy notifications@github.com wrote:

@pietgeursen https://github.com/pietgeursen What do you think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ssbc/ssb-ebt/issues/22#issuecomment-487702098, or mute the thread https://github.com/notifications/unsubscribe-auth/ACQRI6UZVNCGVVJF4UCI4SLPS5AP5ANCNFSM4G7DUKKQ .

dominictarr commented 5 years ago

I would still like to know what was meant by "4 mb" how was it being measured? was it detected at the receive side or the send side?

https://github.com/ssbc/ssb-ebt/issues/22#issuecomment-473788800

dominictarr commented 5 years ago

hey everyone I'd quite like to fix this, but I don't have enough information to know how to reproduce this. do I need two computers? @mixmix

dominictarr commented 5 years ago

hey everyone I'd quite like to fix this, but I don't have enough information to know how to reproduce this. do I need two computers? @mixmix

mmckegg commented 5 years ago

In my experience you need two computers AND an internet connection between them (LAN doesn’t seem to trigger it).

Me and @pietgeursen wrote up detailed reproduction steps a few months back, but not sure what the link is. Maybe Piet has it?

Something like:

connect to remote peer
follow them
watch the data flow.... until it stalls!
reconnect and the data starts to flow again

We found that this could be reproduced by connecting to any large pub. We were also able to reproduce it with a tunneled connection over the internet between our two sbots.

One way to make this work with only one computer might be to tunnel a port via a remote machine using ssh and then run two sbots on the same machine communicating over this port.

The key takeaway is that the transport seems to be important. On my internet, this happens almost every time I try to do a large sync with a remote peer.

dominictarr commented 5 years ago

@mmckegg thanks! yeah it smells like it's actually related to something closer to networking, maybe it's time for muxrpc to finally get back pressure...

dominictarr commented 5 years ago

debug idea: ssb-replicate works, but is slow, ebt goes fast, but then stalls. what happens if you slow ebt down to ssb-replicate speed?

mixmix commented 5 years ago

I could reproduce this by trying to do initial sync. with Patchbay you can easily manage creating new identities, and if there's a peer on the same network it should be hella fast to follow them and get all the stuff. AND IT IS... except it crashes out and doesn't pick it up sometimes.

You might be able to do it with 2 peers on the same machine ... but it's easy if there's a friend in the room to get this happening. It's easy to spot when it happens with the onboarding I've partly built for ssb-ahoy but you can see it with watch ssb-server status too - just watch for the database stopping growing when you know it should go higher.

I saw it stopping at seemingly random points:

gets 45MB in
gets 201MB in etc

dominictarr commented 5 years ago

update: I tried replicating with my self, two sbots running on one computer. it went really slow, ~200 messages per second. (which I measured by adding a setInterval and console.log into muxrpc/pull-weird) then I tried replicating with my pub, and I saw 2000+ messages per second. I also saw it disconnect and reconnect, but I didn't see it stall yet.

dominictarr commented 5 years ago

I still do not understand how you where measuring "4 mb per second" are you just looking at how fast the log was growing?

dominictarr commented 5 years ago

currently I'm testing this from a script that loads ssb plugins directly... could the frontends be causing this?

dominictarr commented 5 years ago

I am measuring the message receive rate with this command:

 tail -f $TEMP_SSB_DIR/flume/log.offset | pv > /dev/null

pv is "pipe view", a very handy command.

dominictarr commented 5 years ago

I'm only seeing 1-2 mb a second that way, could be my slow computer though.

dominictarr commented 5 years ago

with my script and minimal indexes, I replicated the entire log, 563 mb of log, from following my self with hops:2 in 6 minutes.

dominictarr commented 5 years ago

I noticed my connections where going down and coming back up, eventually realized that my server was crashing because of ssb-viewer (something in the markdown parser)... now that the connection lives longer, maybe it will stall

dominictarr commented 5 years ago

hmm, it still seems to be disconnecting and reconnecting...

dominictarr commented 5 years ago

I set timers.inactivity = -1 and I saw it stall. it was still connected, but the EBT stream ended... trying to figure out why that would happen...

dominictarr commented 5 years ago

my first hunch is it ends after an invalid message, vaguely remember something about that...

dominictarr commented 5 years ago

:boom: I figured it out!

there was a mistake here: https://github.com/dominictarr/epidemic-broadcast-trees/blob/master/events.js#L448 If you are connected to a peer, and anyone blocks them, you mark them as blocked, which causes the replication stream to end: https://github.com/dominictarr/epidemic-broadcast-trees/blob/master/stream.js#L96-L97 but this does not disconnect the stream. If you have config.timers.inactivitiy set to something small, it will disconnect because the stream goes quiet, but the default behaviour https://github.com/ssbc/secret-stack/blob/master/core.js#L93 if you are running ssb-config, it sets config.timers which means inactivity is set to 600e3, ten minutes.

I wasn't seeing at first, because I wasn't using the config module in my script, so inactivitiy was 5 seconds, which I didn't notice.

The fix is simple, the line in events.js should be:

if(state.id == ev.id && state.peers[ev.target]) {

i.e. only end the stream if we block them.

dominictarr commented 5 years ago

this should be fixed in ssb-ebt@5.6.5, please confirm @mixmix @christianbundy @mmckegg

mixmix commented 5 years ago

this is great news, looking forward to testing this @dominictarr !

There might have been 2 ways I was roughly testing speed in the past. Watching the log size grow by

calling sbot.status every 2 seconds from the client
running $ watch -d ls -lah ~/.ssb/flume/log.offset (similar, but doesn't bug the sbot) The "speed" was something I did in my head looking at how many MB it was jumping (roughly) on each poll. There was a 4x difference between classic replication + ebt.

Notes on those stats:

between 2 computers on same wifi
both running SSDs

ssbc / ssb-ebt

connections die somtimes (and don't recover) #22