matrix-org / dendrite

Dendrite is a second-generation Matrix homeserver written in Go!
https://matrix-org.github.io/dendrite/
Apache License 2.0
5.63k stars 664 forks source link

Appservices/Bridges do not get events for random amounts of time #2677

Closed MaximilianGaedig closed 2 years ago

MaximilianGaedig commented 2 years ago

Background information

Description

Steps to reproduce

Unfortunatley after thorough investigation I could not find the cause nor a fix for this, everything looked normal, logs, database, logs of appservices/bridges, restarting does not fix the issue, restarted every docker container I had and it still persisted, this happened to another person in the matrix dendrite chat so this doesn't look like a faulty setup to me but an issue with 0.95. Downgrading to 0.9.4 did not help tho.

On request I can provide full logs but there is nothing to see there according to my investigations.

Edit: Thanks to @mjohnson9 for coming up with a workaround (deleting the jetstream directory)

mjohnson9 commented 2 years ago

I'm experiencing the same issue. I have a nearly identical setup, but on Kubernetes. I'm using Cinny instead of Element, but I don't think that's relevant to this issue.

According to traffic graphs and logs, it seems that the bridges stopped receiving traffic at about 04:57 UTC on August 26th. I don't have any alerts from Prometheus indicating that I experienced any underlying faults at or around this time, nor was there any automated or manual maintenance activity happening. I can't find any events to correlate this to.

Here are the last 2 minutes of logs before the traffic to the bridges stopped. I may have been a bit overzealous with redactions, because I don't know what all of the values represent.

time="2022-08-26T04:55:51.873200708Z" level=info msg=PerformPusherCreation display_name="Element (iOS)" localpart=[REDACTED - ref: 18] pushkey="[REDACTED - ref: 4]" req.id=[REDACTED - ref: 0] req.method=POST req.path=/_matrix/client/r0/pushers/set user_id="@[REDACTED - ref: 18]:johnson.gg"
time="2022-08-26T04:55:57.690995413Z" level=info msg="Uploading file" ContentType=image/jpeg FileSizeBytes=288124 Origin=johnson.gg UploadName=[REDACTED - ref: 2] req.id=[REDACTED - ref: 3] req.method=POST req.path=/_matrix/media/r0/upload user_id="@[REDACTED - ref: 18]:johnson.gg"
time="2022-08-26T04:55:57.730133754Z" level=info msg="File uploaded" Base64Hash=[REDACTED - ref: 5] ContentType=image/jpeg FileSizeBytes=288124 Origin=johnson.gg UploadName=[REDACTED - ref: 2] media_id=[REDACTED - ref: 4] req.id=[REDACTED - ref: 1] req.method=POST req.path=/_matrix/media/r0/upload user_id="@[REDACTED - ref: 18]:johnson.gg"
time="2022-08-26T04:55:57.861480910Z" level=info msg="Generated thumbnail" ActualHeight=32 ActualWidth=32 Height=32 Origin=johnson.gg ResizeMethod=crop Width=32 media_id=[REDACTED - ref: 4] processTime=13.699102ms req.id=[REDACTED - ref: 1] req.method=POST req.path=/_matrix/media/r0/upload user_id="@[REDACTED - ref: 18]:johnson.gg"
time="2022-08-26T04:55:57.933846065Z" level=info msg="Signalling other goroutines waiting for this goroutine to generate the thumbnail." Height=32 Origin=johnson.gg ResizeMethod=crop Width=32 media_id=[REDACTED - ref: 4] req.id=[REDACTED - ref: 1] req.method=POST req.path=/_matrix/media/r0/upload user_id="@[REDACTED - ref: 18]:johnson.gg"
time="2022-08-26T04:55:57.977970458Z" level=info msg="Generated thumbnail" ActualHeight=96 ActualWidth=96 Height=96 Origin=johnson.gg ResizeMethod=crop Width=96 media_id=[REDACTED - ref: 4] processTime=43.737987ms req.id=[REDACTED - ref: 1] req.method=POST req.path=/_matrix/media/r0/upload user_id="@[REDACTED - ref: 18]:johnson.gg"
time="2022-08-26T04:55:57.981359692Z" level=info msg="Generated thumbnail" ActualHeight=720 ActualWidth=720 Height=720 MediaID=[REDACTED - ref: 4] Origin=johnson.gg ResizeMethod=scale Width=960 processTime=77.127208ms req.id=[REDACTED - ref: 6] req.method=GET req.path=/_matrix/media/r0/thumbnail/johnson.gg/[REDACTED - ref: 4]
time="2022-08-26T04:55:58.045096254Z" level=info msg="Signalling other goroutines waiting for this goroutine to generate the thumbnail." Height=96 Origin=johnson.gg ResizeMethod=crop Width=96 media_id=[REDACTED - ref: 4] req.id=[REDACTED - ref: 1] req.method=POST req.path=/_matrix/media/r0/upload user_id="@[REDACTED - ref: 18]:johnson.gg"
time="2022-08-26T04:55:58.045208049Z" level=info msg="Signalling other goroutines waiting for this goroutine to generate the thumbnail." Height=720 MediaID=[REDACTED - ref: 4] Origin=johnson.gg ResizeMethod=scale Width=960 req.id=[REDACTED - ref: 6] req.method=GET req.path=/_matrix/media/r0/thumbnail/johnson.gg/[REDACTED - ref: 4]
time="2022-08-26T04:55:58.076225800Z" level=info msg="Generated thumbnail" ActualHeight=480 ActualWidth=480 Height=480 Origin=johnson.gg ResizeMethod=scale Width=640 media_id=[REDACTED - ref: 4] processTime=30.890368ms req.id=[REDACTED - ref: 1] req.method=POST req.path=/_matrix/media/r0/upload user_id="@[REDACTED - ref: 18]:johnson.gg"
time="2022-08-26T04:55:58.094598595Z" level=info msg="Signalling other goroutines waiting for this goroutine to generate the thumbnail." Height=480 Origin=johnson.gg ResizeMethod=scale Width=640 media_id=[REDACTED - ref: 4] req.id=[REDACTED - ref: 1] req.method=POST req.path=/_matrix/media/r0/upload user_id="@[REDACTED - ref: 18]:johnson.gg"
time="2022-08-26T04:55:58.241429640Z" level=info msg="Sent event to roomserver" event_id="$[REDACTED - ref: 12]" req.id=[REDACTED - ref: 7] req.method=PUT req.path="/_matrix/client/r0/rooms/![REDACTED - ref: 8]:johnson.gg/send/m.room.message/[REDACTED - ref: 9]" room_id="![REDACTED - ref: 8]:johnson.gg" room_version=9 user_id="@[REDACTED - ref: 18]:johnson.gg"
time="2022-08-26T04:56:02.463263374Z" level=info msg="Sent event to roomserver" event_id="$[REDACTED - ref: 13]" req.id=[REDACTED - ref: 15] req.method=PUT req.path="/_matrix/client/r0/rooms/![REDACTED - ref: 8]:johnson.gg/send/m.room.message/[REDACTED - ref: 10]" room_id="![REDACTED - ref: 8]:johnson.gg" room_version=9 user_id="@[REDACTED - ref: 18]:johnson.gg"
time="2022-08-26T04:56:09.343720319Z" level=info msg="Sent event to roomserver" event_id="$[REDACTED - ref: 14]" req.id=[REDACTED - ref: 16] req.method=PUT req.path="/_matrix/client/v3/rooms/![REDACTED - ref: 8]:johnson.gg/send/m.room.message/[REDACTED - ref: 11]" room_id="![REDACTED - ref: 8]:johnson.gg" room_version=9 user_id="@[REDACTED - ref: 17]:johnson.gg"
time="2022-08-26T04:56:24.909898110Z" level=info msg="Sending EDU event" destinations=4115 edu_type=m.presence
mjohnson9 commented 2 years ago

I just noticed on startup that I have 24k messages in the NATS stream DendriteOutputRoomEvent. I'm wondering if this is related.

matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.196150338-04:00 [1] [INF]   Starting restore for stream '$G > DendriteInputDeviceListUpdate'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.196704270-04:00 [1] [INF]   Restored 0 messages for stream '$G > DendriteInputDeviceListUpdate'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.196806556-04:00 [1] [INF]   Starting restore for stream '$G > DendriteInputRoomEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.197088895-04:00 [1] [INF]   Restored 0 messages for stream '$G > DendriteInputRoomEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.197234434-04:00 [1] [INF]   Starting restore for stream '$G > DendriteOutputClientData'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.197467808-04:00 [1] [INF]   Restored 0 messages for stream '$G > DendriteOutputClientData'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.197589382-04:00 [1] [INF]   Starting restore for stream '$G > DendriteOutputKeyChangeEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.197880793-04:00 [1] [INF]   Restored 0 messages for stream '$G > DendriteOutputKeyChangeEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.197989405-04:00 [1] [INF]   Starting restore for stream '$G > DendriteOutputNotificationData'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.198225014-04:00 [1] [INF]   Restored 0 messages for stream '$G > DendriteOutputNotificationData'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.198352900-04:00 [1] [INF]   Starting restore for stream '$G > DendriteOutputReadUpdate'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.198589797-04:00 [1] [INF]   Restored 0 messages for stream '$G > DendriteOutputReadUpdate'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.198704312-04:00 [1] [INF]   Starting restore for stream '$G > DendriteOutputReceiptEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.198941575-04:00 [1] [INF]   Restored 0 messages for stream '$G > DendriteOutputReceiptEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.199039065-04:00 [1] [INF]   Starting restore for stream '$G > DendriteOutputRoomEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.201919294-04:00 [1] [INF]   Restored 24,245 messages for stream '$G > DendriteOutputRoomEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.202036301-04:00 [1] [INF]   Starting restore for stream '$G > DendriteOutputSendToDeviceEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.202275850-04:00 [1] [INF]   Restored 0 messages for stream '$G > DendriteOutputSendToDeviceEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.202366673-04:00 [1] [INF]   Starting restore for stream '$G > DendriteOutputStreamEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.202610305-04:00 [1] [INF]   Restored 0 messages for stream '$G > DendriteOutputStreamEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.202628023-04:00 [1] [INF]   Recovering 1 consumers for stream - '$G > DendriteInputDeviceListUpdate'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.203177186-04:00 [1] [INF]   Recovering 58 consumers for stream - '$G > DendriteInputRoomEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.212331555-04:00 [1] [INF]   Recovering 1 consumers for stream - '$G > DendriteOutputClientData'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.212543662-04:00 [1] [INF]   Recovering 2 consumers for stream - '$G > DendriteOutputKeyChangeEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.212945154-04:00 [1] [INF]   Recovering 1 consumers for stream - '$G > DendriteOutputNotificationData'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.213120337-04:00 [1] [INF]   Recovering 1 consumers for stream - '$G > DendriteOutputReadUpdate'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.213313617-04:00 [1] [INF]   Recovering 2 consumers for stream - '$G > DendriteOutputReceiptEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.213677839-04:00 [1] [INF]   Recovering 3 consumers for stream - '$G > DendriteOutputRoomEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.214201593-04:00 [1] [INF]   Recovering 2 consumers for stream - '$G > DendriteOutputSendToDeviceEvent'
matrix-dendrite-5684bbcd45-tchx7 matrix-dendrite 2022-08-26T13:13:35.214511698-04:00 [1] [INF]   Recovering 1 consumers for stream - '$G > DendriteOutputStreamEvent'
mjohnson9 commented 2 years ago

More digging, and I should've checked this earlier: Dendrite is causing my Postgres database to be 100% busy. With my modest hardware, that's only ~250 QPS. The failure might correspond to when I joined #matrix:matrix.org. Most of the busy time is spent on the roomserver_ tables.

Perhaps some kind of room state reconciliation, and so Dendrite isn't processing the appservice's room events?

mjohnson9 commented 2 years ago

Ended up resolving for now by deleting the contents of the jetstream directory

MaximilianGaedig commented 2 years ago

Ended up resolving for now by deleting the contents of the jetstream directory

Ah so a problem with jetstream, thank you for coming up with a workaround, we need a proper fix tho, also how critical is the data in jetstream, will I destory messages or smth like that?

mjohnson9 commented 2 years ago

You'll destroy everything in the queue to be delivered, I think. And it ended up only being very temporary for me. It started happening again about 7 hours after I cleared the jetstream directory

MaximilianGaedig commented 2 years ago

For me it occured again 20 minutes ago until now, cleared jetstream directory, restarted dendrite, saw a transaction getting sent to my signal mautrix bridge, did not send anything on signal tho which is weird (maybe because of that queued event being destroyed)

S7evinK commented 2 years ago

I've just pushed a commit to main that logs an error which can cause the DendriteOutputRoomEvent stream to build up that many events. The docker images will be ready in ~1h I think.

neilalexander commented 2 years ago

FWIW I am refactoring the appservice component completely in #2687 which should improve things quite a bit. It would be great if you can test that branch to see if the problem still persists.

mjohnson9 commented 2 years ago

I ended up swapping back to Synapse temporarily. I can't commit to a timeframe, but I'll give that branch a try as soon as I have the time. :)

MaximilianGaedig commented 2 years ago

FWIW I am refactoring the appservice component completely in #2687 which should improve things quite a bit. It would be great if you can test that branch to see if the problem still persists.

Problem is it appears randomly, didn't have it for the last 2 days, or missed it. But I can test out the branch if there is a docker image

MaximilianGaedig commented 2 years ago

It happened again today for the last 40mins, 12h ago was my last relayed message, I am running dendrite 0.9.6, updated 8h ago

MaximilianGaedig commented 2 years ago

I got some interesting log messages tho!

time="2022-09-02T20:37:42.208912801Z" level=error msg="Unable to send transaction to appservice, backing off for 1m4s" appservice=telegram error="received HTTP status code 413 from appservice"
time="2022-09-02T20:37:42.259461953Z" level=error msg="Unable to send transaction to appservice, backing off for 1m4s" appservice=signal error="received HTTP status code 413 from appservice"
time="2022-09-02T20:37:42.474682190Z" level=error msg="Unable to send transaction to appservice, backing off for 1m4s" appservice=facebook error="received HTTP status code 413 from appservice"
time="2022-09-02T20:38:49.322307813Z" level=error msg="Unable to send transaction to appservice, backing off for 1m4s" appservice=telegram error="received HTTP status code 413 from appservice"
time="2022-09-02T20:38:49.459165112Z" level=error msg="Unable to send transaction to appservice, backing off for 1m4s" appservice=facebook error="received HTTP status code 413 from appservice"
time="2022-09-02T20:38:49.547374130Z" level=error msg="Unable to send transaction to appservice, backing off for 1m4s" appservice=signal error="received HTTP status code 413 from appservice"
time="2022-09-02T20:39:57.389215036Z" level=error msg="Unable to send transaction to appservice, backing off for 1m4s" appservice=signal error="received HTTP status code 413 from appservice"
time="2022-09-02T20:39:57.411556268Z" level=error msg="Unable to send transaction to appservice, backing off for 1m4s" appservice=telegram error="received HTTP status code 413 from appservice"
time="2022-09-02T20:39:57.449653828Z" level=error msg="Unable to send transaction to appservice, backing off for 1m4s" appservice=facebook error="received HTTP status code 413 from appservice"
time="2022-09-02T20:41:04.692971167Z" level=error msg="Unable to send transaction to appservice, backing off for 1m4s" appservice=signal error="received HTTP status code 413 from appservice"
time="2022-09-02T20:41:04.809044067Z" level=error msg="Unable to send transaction to appservice, backing off for 1m4s" appservice=telegram error="received HTTP status code 413 from appservice"
time="2022-09-02T20:41:04.858853452Z" level=error msg="Unable to send transaction to appservice, backing off for 1m4s" appservice=facebook error="received HTTP status code 413 from appservice"

Maybe

I've just pushed a commit to main that logs an error which can cause the DendriteOutputRoomEvent stream to build up that many events. The docker images will be ready in ~1h I think.

added those log messages or something else changed as those are new

MaximilianGaedig commented 2 years ago

So that seems to be maybe a combined issue of dendrite sending too big packets and/or mautrix bridges having a 1MB limit, after setting appservice.max_body_size to 100(MB) messages started coming in. This should be further investigated tho because this does apparently not happen on synapse. Maybe this jetstream cache or whatever it is is storing messages and then bursting them all at once which makes it go over 1MB

S7evinK commented 2 years ago

The rewrite of the appservice component contained a bug, which would cause it to send too many events to the appservice. This is fixed on main now and should hopefully fix the HTTP 413 from the AS. (Might as well could fix the issue regarding request size)

MaximilianGaedig commented 2 years ago

I will keep this issue still opened for a couple days, if it won't reoccur I'll close it :)

MaximilianGaedig commented 2 years ago

This did not reoccur, to conclude, fixed, worst case try setting mautrix bots appservice.max_body_size to 100(MB) if still happening