matrix-org / matrix-appservice-irc

Node.js IRC bridge for Matrix
Apache License 2.0
460 stars 151 forks source link

The Libera bridge is not monitored #1634

Open progval opened 1 year ago

progval commented 1 year ago

In the last three months, the Libera bridge has experienced three outages affecting all channels (#1601, #1628, #1633).

There should probably be some automated monitoring to detect this kind of outage, rather than relying on users noticing and reporting them.

For example, a bot with its own private channel / portal room, which sends a message on both Matrix and IRC, and checks if it was received on the other end within 5 seconds (or whatever delay is considered acceptable); then reports it to the appropriate channels, such as https://status.matrix.org/ (which currently shows 100.0% uptime of the Libera in the last three months)

Half-Shot commented 1 year ago

We have automated monitoring but the types of failures that we have seen are not fitting the models we expect. The status.matrix.org page is manually updated at the moment, and I think we could do better to update this as fires happen.

Half-Shot commented 1 year ago

I believe this situation is now improved, we've got monitoring on the bridge for:

This doesn't cover everything, and our next objectives are to track the # of dropped messages (and ideally, some sort of E2E monitoring to see where messages are going missing).

progval commented 1 year ago

there are currently a number of outages reported on https://github.com/matrix-org/libera-chat/issues, including clients stuck reconnecting (https://github.com/matrix-org/libera-chat/issues/12, https://github.com/matrix-org/libera-chat/issues/24), but status.matrix.org shows 100% green.

toabi commented 1 year ago

Lots of puppets can't currently join the IRC side… but status says "green".