matrix-org / matrix-appservice-irc

Node.js IRC bridge for Matrix
Apache License 2.0
465 stars 151 forks source link

Planned server restart on Libera causes the bridge to be silently unavailable for 2.5 hours #1601

Closed progval closed 1 year ago

progval commented 2 years ago

Libera had a planned netsplit today at 19:55 UTC, and still has not recovered at the time I am writing this. No message is going from IRC to Matrix, and most Matrix users' messages are not sent to IRC, because their puppets did not reconnect.

Additionally, the org.matrix.appservice-irc.connection event is not set on the Matrix side until someone speaks on the IRC side.

progval commented 2 years ago

It's working again since ~22:35 UTC; looks like the interruption was because of the privacy filter kicking in while the bridge took 2.5h to reconnect all the puppets

Half-Shot commented 2 years ago

Needs investigation from the team to see why this took so long.

tadzik commented 2 years ago

I see a wave of these in the logs:

Aug 30 20:30:04 liberairc[12912]: 2022-08-30 20:30:04 ERROR:client-connection Server: irc.libera.chat (XXX) Network Error: {
Aug 30 20:30:04 liberairc[12912]:   "errno": -110,
Aug 30 20:30:04 liberairc[12912]:   "code": "ETIMEDOUT",
Aug 30 20:30:04 liberairc[12912]:   "syscall": "read"
Aug 30 20:30:04 liberairc[12912]: }
Aug 30 20:30:04 liberairc[12912]: 2022-08-30 20:30:04 INFO:client-connection disconnect()ing XXX@irc.libera.chat - net_error
Aug 30 20:30:04 liberairc[12912]: 2022-08-30 20:30:04 WARN:ClientPool Client bzzzzzzzzzzz (@XXX:XXX.com) disconnected with reason net_error

But nothing after 20:30 – seems like it took over half an hour for the bridge to notice its old connections timing out, at which point it reconnected the puppets – and the connections themselves took as short as they usually do.

I see nothing hinting at a 2.5 hour delay though.

progval commented 2 years ago

This is happening again today (2022-09-07). Timeline (in UTC) in #matrix-irc as far as I can tell:

Still no IRC->Matrix relaying on #matrix-irc last I checked (2022-09-12 08:18:33)

podiki commented 2 years ago

Seeing the same thing here, I can send from Matrix, it goes through, but do not get messages from IRC->Matrix. One channel (stumpwm) is working fine, while another (guix), has been this way for about 12 hours as of now. I've tried disconnecting and reconnecting from Matrix to the channel, with no changes.

podiki commented 2 years ago

Another day later and still only have one way connection in #guix (two other rooms seem fine though). Anything we can do?

simonmichael commented 2 years ago

I confirm an outage still in progress, eg #haskell IRC channel has not been relaying to matrix since 2022-09-08 0330 UTC.

exarkun commented 2 years ago

The last message I saw come through on Matrix from #python was at 2022-09-08 2116 (I don't know what TZ Element is reporting this timestamp in).

eeickmeyer commented 2 years ago

Confirming outage is still in progress. All #ubuntustudio rooms are one-way Matrix -> IRC.

CairnThePerson commented 2 years ago

Still having issues in #emacs and #guix.

podiki commented 2 years ago

Is there anyone to contact to restart these bridges? Given it has been days now, I'm not holding my breath they will fix themselves.

Edit: they know https://fosstodon.org/@liberachat/108976211933737326

progval commented 2 years ago

Element replied on Twitter https://twitter.com/element_hq/status/1568710974210719745

progval commented 2 years ago

Element just restarted the bridge, it's slowly coming back up

progval commented 2 years ago

Seems to be working fine in rooms that were working a week ago. You should probably open a new issue if you see a room that doesn't work

progval commented 2 years ago

It's happening again. Netsplit today at 11:46, and all IRC->Matrix and some Matrix->IRC messages are still missing at 12:32.

EDIT: and it fixed itself at 13:21:14 (possibly a little later for other channels), so about 2.5 hours later

EDIT2: #matrix-irc was missing a puppet until I spoke in the channel, much later than that:

15:36:26 <val> jA_cOp: it's not restarted several times a week, it just crashes when it sees too much traffic at once
15:36:26 --> psydroid [psydroid] (@psydroid:matrix.org) (~psydroid@user/psydroid) has joined #matrix-irc
15:37:54 <val> test

only the second message went through.

Half-Shot commented 1 year ago

While we're still experiencing shaky bridge behavior, the project has moved on somewhat and this is better tracked in the later issues.