sm00th / bitlbee-discord

Bitlbee plugin for Discord (http://discordapp.com)
GNU General Public License v2.0
290 stars 27 forks source link

High CPU usage and "closing websocket connection" #128

Closed roughnecks closed 6 years ago

roughnecks commented 6 years ago

Hi there.

Since a few days now, whenever I have bitlbee-discord running for a few hours, I get these errors and it's unable to recover; I have to quit weechat.

Also CPU usage is quite high when that happens - honestly it's high every time it connects and disconnects, even in "normal" conditions.

15:26 <@root> discord - Remote host is closing websocket connection 15:26 <@root> discord - Signing off.. 15:26 <@root> discord - Reconnecting in 5 seconds.. 15:27 <@root> discord - Logging in: Logged in 16:39 <@root> discord - Error: Failed to read data. 16:39 <@root> discord - Signing off.. 16:40 <@root> discord - Reconnecting in 5 seconds.. 16:41 <@root> discord - Logging in: Logged in 16:41 <@root> discord - Remote host is closing websocket connection 16:41 <@root> discord - Signing off.. 16:41 <@root> discord - Reconnecting in 5 seconds..

This is a breaking bug for me, since I cannot monitor my chat 24/24h and I can't have my CPU at 40% for "nothing". Just to address some questions before you ask, I'm running stable weechat on a Debian Stretch Virtualbox VM inside Windows 10 x64 fully upgraded. My CPU is Athlon II X3 @3.10Ghz and VM is able to use 2 cores out of 3, uncapped. Also worth saying that VM only has weechat running - nothing else.

Thanks

dequis commented 6 years ago

What plugin version? Type 'plugins' in &bitlbee

dgw commented 6 years ago

Is that CPU usage on the client end, not the BitlBee server process? I have noticed performance issues on my client too (The Lounge connected through ZNC).

roughnecks commented 6 years ago

What plugin version? Type 'plugins' in &bitlbee

16:59 <@roughnecks> plugins 16:59 <@root> Plugin Version 16:59 <@root> bitlbee-discord 0.4.1 16:59 <@root> steam 20170529~ a6444d2 ~141 16:59 <@root>

roughnecks commented 6 years ago

Is that CPU usage on the client end, not the BitlBee server process? I have noticed performance issues on my client too (The Lounge connected through ZNC).

Not sure, I had to quit every time and couldn't check.

sm00th commented 6 years ago

Does cpu spike on connect only or you also experience high cpu loads while it is connected? If that only happens on connect the most likely cause I can think of is #47. Are you in any highly-populated servers?

roughnecks commented 6 years ago

I removed from autojoin all the channels in a big server (like 9K users) before this started happening, so I am only in /r/Steam (5K max I think, if not less) and /r/SteamLink (less than 2K).

CPU spikes only upon connect and disconnect.

roughnecks commented 6 years ago

I mean, I don't really mind if CPU spikes upon connect, it only lasts 1 minute or so, but what about the websocket thing? It doesn't reconnect at all when that happens and every attempt is causing CPU spikes.

sm00th commented 6 years ago

what about the websocket thing?

This message is shown when discord servers are closing websocket connection without properly saying goodbye. There are 3 reasons I know why discord would do that:

It doesn't reconnect at all when that happens

So after first websocket close it just keeps happening until you restart bitlbee or is it enough to stop trying for some time? Anyway, looking at debuglog around the reconnects should be the next step, unfortunately it might be a bit noisy on big servers like that. My usual warning: if you are going to paste any parts of the debug log - make sure it doesn't contain your password, token or parts of the conversations you don't want others to see, if you are not sure you removed all of it - send it to my email (can be found in commit log) instead of posting it here.

roughnecks commented 6 years ago

Thanks for explanation. I couldn't reproduce the bug anymore. All's working fine so far, no more websocket disconnections.

sm00th commented 6 years ago

@roughnecks sent me logs showing this issue, and from what I can gather it is the heartbeats issue. The log contains 3 disconnects after which bd(bitlbee-discord) seems to have recovered on its own (please correct me if I am wrong here). All of the disconnects are preceded by bd sending heartbeat messages (op:1), but never receiving akcs for them. Websocket connection itself remains functional and continues to receive messages for a while before it gets closed. Discord documentation section on heartbeats states that If a client does not receive a heartbeat ack between its attempts at sending heartbeats, it should immediately terminate the connection with a non-1000 close code, reconnect, and attempt to resume. So it looks like some maintenance was taking place and bd didn't handle it correctly. They don't specify the time window we should receive ack in, leaving room for speculation and experiments. It looks like all of the disconnects happened in about 15 seconds after an unacknowledged heartbeat and during normal operation the acks are sent right away and arrive within a second, so maybe a good place to start would be initiating reconnect after 5-10 seconds without the ack.

Alcaro commented 6 years ago

They don't specify the time window we should receive ack in

They do. Ack timeout is equal to heartbeat interval. (Though I'll agree it's kinda hidden.)

If a client does not receive a heartbeat ack between its attempts at sending heartbeats, it should immediately terminate the connection with a non-1000 close code, reconnect, and attempt to resume.

sm00th commented 6 years ago

They do. Ack timeout is equal to heartbeat interval. (Though I'll agree it's kinda hidden.)

That was my first instinct as well, but the heartbeat interval in this case was 41250ms, and yet it closed websocket connection after just 15s.

sm00th commented 6 years ago

I've implemented discord's resume that enables bd to gracefully reconnect without pulling all of the server info again, I've also added heartbeat-timeout that would try to reconnect in this fassion. It is not merged to master yet but I'd be grateful if you could try it out and see if this helps in any way. I've ran it for a couple of days myself, but I only ran into the issue once and even then gateway wasn't responding on reconnect.

You can find the changes in resume branch

sm00th commented 6 years ago

Still haven't encountered this case in the wild, but at least it doesn't seem to break anything, so resume branch is merged.

dgw commented 6 years ago

Just updated to 6a251d1 myself, we'll see if I see a change in disconnect frequency.

sm00th commented 6 years ago

After running this for over a week I found that heartbeat timeout doesn't really work, I always get some kind of websocket issue before it times out. On the other hand, every websocket error when we are already connected seems to be able to be treated by soft-reconnect greatly reducing the pains of initial connection. I've pushed it to master, hopefully it won't break anything. If there are no issues with this I will be closing this issue in about a week.

Corosauce commented 6 years ago

The soft reconnect patch is definitely helping usability, no more multi minute 100% cpu and no bitlbee activity while it reconnect situations, was about a month behind on latest changes before updating yesterday.

From what I recall of the disconnecting before, it seems to be about the same, here is a log of the past 25 hours since rebuilding with latest yesterday, left in nickname stuff for transparency sake: https://gist.github.com/Corosauce/73e3ef6a6fbdf5277290b55b24dcb577

dgw commented 6 years ago

I too have found soft reconnect to be extremely helpful as far as usability. I can usually read the accumulated backlog in all of my Discord channels now just in client scrollback (500 lines) instead of logging in to view my logs. Big :+1: for the patch!

I've not seen nearly as many reconnects either, so overall this seems to be a great update. Thanks!

sm00th commented 6 years ago

Great, thanks for feedback. It seems to be working as expected so I am closing this issue.

P.S. @Corosauce set utf8_nicks true should help fix the nick noise.

Corosauce commented 6 years ago

Ahhh thanks! Hadn't gotten around to figuring that out.