matrix-org / dendrite

Dendrite is a second-generation Matrix homeserver written in Go!
https://matrix-org.github.io/dendrite/
Apache License 2.0
5.67k stars 664 forks source link

Crash on startup when handling keyserver_stale_device_lists #1511

Closed jhulkko closed 3 years ago

jhulkko commented 3 years ago

Background information

Description

Dendrite always crashes during startup due to some content on keyserver_stale_device_lists. Not all list entires seem to cause this, but when the server fails to start flushing the said list or deleting the table will always resolve the issue.

Steps to reproduce

No definite way to reproduce as the root cause is unknown.

Details

Log entries of the event:

Oct 10 22:24:56 raspberrypi systemd[1]: Started dendrite matrix homeserver.
Oct 10 22:24:56 raspberrypi env[7760]: time="2020-10-10T19:24:56.128421959Z" level=info msg="Dendrite version 0.1.0" func="NewBaseDendrite\n\t" file=" [/opt/dendrite/internal/setup/base.go:102]"
Oct 10 22:24:56 raspberrypi env[7760]: time="2020-10-10T19:24:56.265032082Z" level=info msg="Enabled perspective key fetcher" func="NewInternalAPI\n\t" file=" [/opt/dendrite/signingkeyserver/signingkeyserver.go:103]" num_public_keys=2 se
Oct 10 22:24:56 raspberrypi env[7760]: panic: runtime error: index out of range [-2]
Oct 10 22:24:56 raspberrypi env[7760]: goroutine 67 [running]:
Oct 10 22:24:56 raspberrypi env[7760]: github.com/matrix-org/dendrite/keyserver/internal.(*DeviceListUpdater).notifyWorkers(0x1e10840, 0x1e15760, 0xf)
Oct 10 22:24:56 raspberrypi env[7760]:         /opt/dendrite/keyserver/internal/device_list_update.go:251 +0x210
Oct 10 22:24:56 raspberrypi env[7760]: github.com/matrix-org/dendrite/keyserver/internal.(*DeviceListUpdater).Start(0x1e10840, 0x0, 0x0)
Oct 10 22:24:56 raspberrypi env[7760]:         /opt/dendrite/keyserver/internal/device_list_update.go:138 +0x12c
Oct 10 22:24:56 raspberrypi env[7760]: github.com/matrix-org/dendrite/keyserver.NewInternalAPI.func1(0x1e10840)
Oct 10 22:24:56 raspberrypi env[7760]:         /opt/dendrite/keyserver/keyserver.go:52 +0x1c
Oct 10 22:24:56 raspberrypi env[7760]: created by github.com/matrix-org/dendrite/keyserver.NewInternalAPI
Oct 10 22:24:56 raspberrypi env[7760]:         /opt/dendrite/keyserver/keyserver.go:51 +0x344
Oct 10 22:24:56 raspberrypi systemd[1]: dendrite.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 10 22:24:56 raspberrypi systemd[1]: dendrite.service: Failed with result 'exit-code'.
Oct 10 22:24:57 raspberrypi systemd[1]: Stopped dendrite matrix homeserver.

Last log entry from PostgreSQL while trying to start:

2020-10-11 15:36:01.678 EEST [1137] dendrite@dendrite LOG:  execute 15: SELECT user_id FROM keyserver_stale_device_lists WHERE is_stale = $1
2020-10-11 15:36:01.678 EEST [1137] dendrite@dendrite DETAIL:  parameters: $1 = 't'
2020-10-11 15:36:01.686 EEST [1135] dendrite@dendrite LOG:  could not receive data from client: Connection reset by peer

Data returned was a list of 11 user handles. Nothing out of ordinary on them.

This time I renamed the table as backup_keyserver_stale_device_lists causing dendrite to re-create original table on next start. This resolved the issue.

I restarted the server process multiple times due to playing around with some config changes and the stale device list kept growing in between. This exact same issue with same errors hit again after some hours of playing around. This time I just removed rows from the table:

dendrite=# delete from keyserver_stale_device_lists WHERE is_stale = 't';
DELETE 10

Afterwards the server started normally again.

If / when this happens again I will remove the rows one by one from the database to see if it is a specific user that triggers this issue.

Lesterpig commented 3 years ago

This bug is due to an integer overflow in the hash function. Should only affect 32-bits systems.

Proposed patch:

diff --git a/keyserver/internal/device_list_update.go b/keyserver/internal/device_list_update.go
index 4d1b1107..4f802293 100644
--- a/keyserver/internal/device_list_update.go
+++ b/keyserver/internal/device_list_update.go
@@ -245,7 +245,7 @@ func (u *DeviceListUpdater) notifyWorkers(userID string) {
        }
        hash := fnv.New32a()
        _, _ = hash.Write([]byte(remoteServer))
-       index := int(hash.Sum32()) % len(u.workerChans)
+       index := int(int64(hash.Sum32()) % int64(len(u.workerChans)))

        ch := u.assignChannel(userID)
        u.workerChans[index] <- remoteServer
lgtrombetta commented 3 years ago

I bumped into this bug while testing dendrite on a Raspberry Pi with a 32 bit system. As far as I could tell, after a fresh install the server would start up normally and it would work for communication among local users. As soon as a user attempted to chat with someone on a federated server, a record in the table mentioned with OP would appear and the server crash. It would not start again until the offending record had been deleted.

The issue is gone after applying the patch suggested in the previous post and rebuilding the server. It now works properly and allows communication with external users.

jhulkko commented 3 years ago

I rebuilt the server with Dendrite version 0.2.1 and the proposed patch. Now everything seems to work as expected on a 32 bit raspberry Pi system.