It got asked to join #ubuntu in freenode, and now it's torturing all available resources.
It stopped processing sync updates due to the large room join. Eventually the homeserver failed under the load and caused some propagating failures to voyager:
/sync error Unknown error code: Unknown message
{ [Unknown error code: Unknown message]
errcode: undefined,
name: 'Unknown error code',
message: 'Unknown message',
data: '<html>\r\n<head><title>504 Gateway Time-out</title></head>\r\n<body bgcolor="white">\r\n<center><h1>504 Gateway Time-out</h1></center>\r\n<hr><center>nginx/1.10.0 (Ubuntu)</center>\r\n</body>\r\n</html>\r\n',
httpStatus: 504 }
Starting keep-alive
info VoyagerBot Sync state: SYNCING -> RECONNECTING
info VoyagerBot Processing 0 pending node updates. 0 remaining
info VoyagerBot Processed 0 node updates. 0 remaining
info VoyagerBot Processing 0 pending node updates. 0 remaining
info VoyagerBot Processed 0 node updates. 0 remaining
info VoyagerBot Sync state: RECONNECTING -> ERROR
ERR! VoyagerBot { error:
ERR! VoyagerBot { [ORG.MATRIX.JSSDK_TIMEOUT: Locally timed out waiting for a response]
ERR! VoyagerBot errcode: 'ORG.MATRIX.JSSDK_TIMEOUT',
ERR! VoyagerBot name: 'ORG.MATRIX.JSSDK_TIMEOUT',
ERR! VoyagerBot message: 'Locally timed out waiting for a response',
ERR! VoyagerBot data:
ERR! VoyagerBot { error: 'Locally timed out waiting for a response',
ERR! VoyagerBot errcode: 'ORG.MATRIX.JSSDK_TIMEOUT',
ERR! VoyagerBot timeout: 15000 } } }
info VoyagerBot Processing 0 pending node updates. 0 remaining
info VoyagerBot Processed 0 node updates. 0 remaining
info VoyagerBot Sync state: ERROR -> ERROR
ERR! VoyagerBot { error:
ERR! VoyagerBot { [ORG.MATRIX.JSSDK_TIMEOUT: Locally timed out waiting for a response]
ERR! VoyagerBot errcode: 'ORG.MATRIX.JSSDK_TIMEOUT',
ERR! VoyagerBot name: 'ORG.MATRIX.JSSDK_TIMEOUT',
ERR! VoyagerBot message: 'Locally timed out waiting for a response',
ERR! VoyagerBot data:
ERR! VoyagerBot { error: 'Locally timed out waiting for a response',
ERR! VoyagerBot errcode: 'ORG.MATRIX.JSSDK_TIMEOUT',
ERR! VoyagerBot timeout: 15000 } } }
info VoyagerBot Processing 0 pending node updates. 0 remaining
info VoyagerBot Processed 0 node updates. 0 remaining
info VoyagerBot Processing 0 pending node updates. 0 remaining
info VoyagerBot Processed 0 node updates. 0 remaining
info VoyagerBot Sync state: ERROR -> ERROR
ERR! VoyagerBot { error:
ERR! VoyagerBot { [ORG.MATRIX.JSSDK_TIMEOUT: Locally timed out waiting for a response]
ERR! VoyagerBot errcode: 'ORG.MATRIX.JSSDK_TIMEOUT',
ERR! VoyagerBot name: 'ORG.MATRIX.JSSDK_TIMEOUT',
ERR! VoyagerBot message: 'Locally timed out waiting for a response',
ERR! VoyagerBot data:
ERR! VoyagerBot { error: 'Locally timed out waiting for a response',
ERR! VoyagerBot errcode: 'ORG.MATRIX.JSSDK_TIMEOUT',
ERR! VoyagerBot timeout: 15000 } } }
This is voyager's monitoring:
This is the homeserver's basic monitoring:
Prometheus metrics are not available for this incident. However, voyager did run up against the resource limits of it's Toronto node:
After voyager recovered, it queued an update of ~105,359 nodes (because it tries to cache member information). This took quite a while to process and ended up with some significant traffic to the database, monopolizing the connection pool.
Voyager is currently stable (as of writing this), however it is still chugging through ~80k node updates.
It got asked to join #ubuntu in freenode, and now it's torturing all available resources.
It stopped processing sync updates due to the large room join. Eventually the homeserver failed under the load and caused some propagating failures to voyager:
This is voyager's monitoring:
This is the homeserver's basic monitoring:
Prometheus metrics are not available for this incident. However, voyager did run up against the resource limits of it's Toronto node:
After voyager recovered, it queued an update of ~105,359 nodes (because it tries to cache member information). This took quite a while to process and ended up with some significant traffic to the database, monopolizing the connection pool.
Voyager is currently stable (as of writing this), however it is still chugging through ~80k node updates.