Closed mixmix closed 4 years ago
If this bug got resolved, this changes the data grabbing part of initialSync from 10 minutes down to 2.5 minutes!
The indexing is just slow at the moment, but that's probably all going to change with some of that new rust work
my hunch is that it's queuing up too much, and blocking the cpu and then timing out?
beautifully fast replication (say 4+MB/s)
are you measuring this at the send side or the receive side?
@mixmix
Thanks for the awesome bug report. I got a bit mixed up between the two peers while reading the "what we see" bit, can you verify that this is correct?
sbot status
polls show that Alice continues to try to reconnectNames [obviously] don't matter, but I want to make sure I understand the setup correctly. Also, are they both configured the same (e.g. does Bob also try to connect to Alice) or do they have designated roles as uploader and downloader?
Last note: is there any test code or steps that you could recommend to reproduce? I've only got one machine but I'd really like to get this fixed ASAP.
I wonder if https://github.com/ssbc/ssb-ebt/pull/15 could help with this problem.
@pietgeursen What do you think?
Matt and I spent a lot of time debugging this a few months ago in patchwork. We had the code from #15 running and it didn't seem to change things.
What we saw:
With ebt:
On Tue, Apr 30, 2019 at 6:59 AM Christian Bundy notifications@github.com wrote:
@pietgeursen https://github.com/pietgeursen What do you think?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ssbc/ssb-ebt/issues/22#issuecomment-487702098, or mute the thread https://github.com/notifications/unsubscribe-auth/ACQRI6UZVNCGVVJF4UCI4SLPS5AP5ANCNFSM4G7DUKKQ .
I would still like to know what was meant by "4 mb" how was it being measured? was it detected at the receive side or the send side?
https://github.com/ssbc/ssb-ebt/issues/22#issuecomment-473788800
hey everyone I'd quite like to fix this, but I don't have enough information to know how to reproduce this. do I need two computers? @mixmix
hey everyone I'd quite like to fix this, but I don't have enough information to know how to reproduce this. do I need two computers? @mixmix
In my experience you need two computers AND an internet connection between them (LAN doesn’t seem to trigger it).
Me and @pietgeursen wrote up detailed reproduction steps a few months back, but not sure what the link is. Maybe Piet has it?
Something like:
We found that this could be reproduced by connecting to any large pub. We were also able to reproduce it with a tunneled connection over the internet between our two sbots.
One way to make this work with only one computer might be to tunnel a port via a remote machine using ssh and then run two sbots on the same machine communicating over this port.
The key takeaway is that the transport seems to be important. On my internet, this happens almost every time I try to do a large sync with a remote peer.
@mmckegg thanks! yeah it smells like it's actually related to something closer to networking, maybe it's time for muxrpc to finally get back pressure...
debug idea: ssb-replicate works, but is slow, ebt goes fast, but then stalls. what happens if you slow ebt down to ssb-replicate speed?
I could reproduce this by trying to do initial sync. with Patchbay you can easily manage creating new identities, and if there's a peer on the same network it should be hella fast to follow them and get all the stuff. AND IT IS... except it crashes out and doesn't pick it up sometimes.
You might be able to do it with 2 peers on the same machine ... but it's easy if there's a friend in the room to get this happening. It's easy to spot when it happens with the onboarding I've partly built for ssb-ahoy but you can see it with watch ssb-server status
too - just watch for the database stopping growing when you know it should go higher.
I saw it stopping at seemingly random points:
update: I tried replicating with my self, two sbots running on one computer. it went really slow, ~200 messages per second. (which I measured by adding a setInterval and console.log into muxrpc/pull-weird) then I tried replicating with my pub, and I saw 2000+ messages per second. I also saw it disconnect and reconnect, but I didn't see it stall yet.
I still do not understand how you where measuring "4 mb per second" are you just looking at how fast the log was growing?
currently I'm testing this from a script that loads ssb plugins directly... could the frontends be causing this?
I am measuring the message receive rate with this command:
tail -f $TEMP_SSB_DIR/flume/log.offset | pv > /dev/null
pv
is "pipe view", a very handy command.
I'm only seeing 1-2 mb a second that way, could be my slow computer though.
with my script and minimal indexes, I replicated the entire log, 563 mb of log, from following my self with hops:2
in 6 minutes.
I noticed my connections where going down and coming back up, eventually realized that my server was crashing because of ssb-viewer (something in the markdown parser)... now that the connection lives longer, maybe it will stall
hmm, it still seems to be disconnecting and reconnecting...
I set timers.inactivity = -1
and I saw it stall. it was still connected, but the EBT stream ended... trying to figure out why that would happen...
my first hunch is it ends after an invalid message, vaguely remember something about that...
:boom: I figured it out!
there was a mistake here: https://github.com/dominictarr/epidemic-broadcast-trees/blob/master/events.js#L448
If you are connected to a peer, and anyone blocks them, you mark them as blocked, which causes the replication stream to end: https://github.com/dominictarr/epidemic-broadcast-trees/blob/master/stream.js#L96-L97 but this does not disconnect the stream. If you have config.timers.inactivitiy
set to something small, it will disconnect because the stream goes quiet, but the default behaviour https://github.com/ssbc/secret-stack/blob/master/core.js#L93 if you are running ssb-config, it sets config.timers
which means inactivity is set to 600e3, ten minutes.
I wasn't seeing at first, because I wasn't using the config module in my script, so inactivitiy was 5 seconds, which I didn't notice.
The fix is simple, the line in events.js should be:
if(state.id == ev.id && state.peers[ev.target]) {
i.e. only end the stream if we block them.
this should be fixed in ssb-ebt@5.6.5, please confirm @mixmix @christianbundy @mmckegg
this is great news, looking forward to testing this @dominictarr !
There might have been 2 ways I was roughly testing speed in the past. Watching the log size grow by
sbot.status
every 2 seconds from the client$ watch -d ls -lah ~/.ssb/flume/log.offset
(similar, but doesn't bug the sbot)
The "speed" was something I did in my head looking at how many MB it was jumping (roughly) on each poll. There was a 4x difference between classic replication + ebt.Notes on those stats:
setup : 2 patchbay based peers on the same network. One long standing, one "new" (no log), with connections locked down to be local-only:
what we see : beautifully fast replication (say 4+MB/s), and then at some point (say after 100 or 400 MB) something about the connection with the other local peer goes wrong, and the log stops getting data.
sbot status
polls show that the peer trying to pull fresh data continues to try to reconnect (as the scheduler is doing its job, and this is the only legal connection), but it never connects again. If I close the "new" peer and start it again it picks up where it left off fine, but might get stuck again.guess : It seems like it's getting itself into some polluted state, but I don't know what the problem is.
meanwhile
ssb-replicate
just works, but it's dramatically slower (1MB/s) :cry:@mmckegg and @pietgeursen have reported the same things in their recent work.