matrix-org / dendrite

Dendrite is a second-generation Matrix homeserver written in Go!
https://matrix-org.github.io/dendrite/
Apache License 2.0
5.67k stars 664 forks source link

Dendrite syncing broken after upgrade to 0.8.8 #2566

Open edocod1 opened 2 years ago

edocod1 commented 2 years ago

Background information

Description

Steps to reproduce

I would write to the dendrite support room, but with my homeserver not working it is very hard to do so. This is not a "pls fix" request but more of a "hey how do i troubleshoot this?"

edocod1 commented 2 years ago

When rebooting the server, it looks like a lot of messages are stuck in DendriteRoomOutputEvent


[23931] [INF] 
[23931] [INF] ---------------- JETSTREAM ----------------
[23931] [INF]   Max Memory:      23.00 GB
[23931] [INF]   Max Storage:     7.24 GB
[23931] [INF]   Store Directory: "jetstream"
[23931] [INF] -------------------------------------------
[23931] [INF]   Restored 0 messages for stream '$G > DendriteInputDeviceListUpdate'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteInputRoomEvent'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputClientData'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputKeyChangeEvent'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputNotificationData'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputReadUpdate'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputReceiptEvent'
[23931] [INF]   Restored 14,609 messages for stream '$G > DendriteOutputRoomEvent'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputSendToDeviceEvent'
[23931] [INF]   Restored 0 messages for stream '$G > DendriteOutputStreamEvent'
[23931] [INF]   Recovering 1 consumers for stream - '$G > DendriteInputDeviceListUpdate'
[23931] [INF]   Recovering 21 consumers for stream - '$G > DendriteInputRoomEvent'
[23931] [INF]   Recovering 1 consumers for stream - '$G > DendriteOutputClientData'
[23931] [INF]   Recovering 2 consumers for stream - '$G > DendriteOutputKeyChangeEvent'
[23931] [INF]   Recovering 1 consumers for stream - '$G > DendriteOutputNotificationData'
[23931] [INF]   Recovering 1 consumers for stream - '$G > DendriteOutputReadUpdate'
[23931] [INF]   Recovering 2 consumers for stream - '$G > DendriteOutputReceiptEvent'
[23931] [INF]   Recovering 2 consumers for stream - '$G > DendriteOutputRoomEvent'
[23931] [INF]   Recovering 2 consumers for stream - '$G > DendriteOutputSendToDeviceEvent'
[23931] [INF]   Recovering 1 consumers for stream - '$G > DendriteOutputStreamEvent'
[23931] [INF] Server is ready
spaetz commented 2 years ago

Have you tried moving the jetstream directory out of the way before a restart, (jetstream/\$G/streams/)? It sometimes seems to get stuck on an upgrade...

edocod1 commented 2 years ago

Yes, i have tried that with no success. But doing that unlocks some more log entries which wouldn't otherwise appear:

neilalexander commented 2 years ago

Can you please enable profiling and then take a goroutine profile after trying to send some messages into a room?

The profile URL will be like this: http://localhost:65432/debug/pprof/goroutine?debug=1

Notmarrco commented 2 years ago

It seems I had the same problem after a power shutdown. Removing the jetstream folder fixed it thanks @spaetz , but I lost all unbridged messages.

By the way, when using ls I don't have /jetstream/$G/streams but instead an endless list of subdirectories named $G/$G/$G... du shows the right tree.

edocod1 commented 2 years ago

@neilalexander profile here profile.txt

neilalexander commented 2 years ago

Can you please also capture a http://localhost:65432/debug/pprof/goroutine?debug=2 (that's debug=2 instead of debug=1)?

Ideally without restarting the process if you can, since that'll nuke possibly valuable information from the trace.

edocod1 commented 2 years ago

https://paste.debian.net/1246946/

Here it is

neilalexander commented 2 years ago

Right, my suspicion is that there's a problem with the database connection pool. It looks like Dendrite is stuck waiting for database connections, which probably explains why it isn't progressing.

Can you please tell me whether or not you have configured database connections separately per-component, or did you configure it once in the global.database section of the config?

What values do you have for max_open_conns, max_idle_conns and conn_max_lifetime?

edocod1 commented 2 years ago

I have configured database connections separately (though they all point to the same db)

Limits are these

# cat dendrite.yaml | grep max_open
# The "max_open_conns" and "max_idle_conns" settings configure the maximum
    max_open_conns: 10
    max_open_conns: 10
    max_open_conns: 10
    max_open_conns: 10
    max_open_conns: 10
    max_open_conns: 10
    max_open_conns: 10
    max_open_conns: 10
root@dendrite:~/dendrite# cat dendrite.yaml | grep max_idle
# The "max_open_conns" and "max_idle_conns" settings configure the maximum
    max_idle_conns: 2
    max_idle_conns: 2
    max_idle_conns: 2
    max_idle_conns: 2
    max_idle_conns: 2
    max_idle_conns: 2
    max_idle_conns: 2
    max_idle_conns: 2
root@dendrite:~/dendrite# cat dendrite.yaml | grep conn_max_l
# "conn_max_lifetime" option controls the maximum length of time a database
    conn_max_lifetime: -1
    conn_max_lifetime: -1
    conn_max_lifetime: -1
    conn_max_lifetime: -1
    conn_max_lifetime: -1
    conn_max_lifetime: -1
    conn_max_lifetime: -1
    conn_max_lifetime: -1
neilalexander commented 2 years ago

Hmm, that all looks reasonable... on a second look, the database connections are probably waiting normally (was looking at sql.go from the wrong Go version and line numbers differed).

Can I ask you to try the latest commit 55d81beab on the neilalexander/nats290 branch which updates the NATS Server to a newer version, just to rule that out?

edocod1 commented 2 years ago

Looks like nothing changed. After some time of running, i still have the exact number of messages in the jetstream queue, and everything is stuck as before.

nullchinchilla commented 2 years ago

I occasionally run into this error too. The classic symptom is that messages seem to send correctly in Element, but they never get to their destination. When Element is closed and reopened, "some recent messages were not sent" pops out.

edocod1 commented 2 years ago

It unlocked for some days, since yesterday it is locked again.

Using dendrite 0.9.1 now

Log is full of "Failed do manually update device lists for user"