matrix-org / dendrite

Dendrite is a second-generation Matrix homeserver written in Go!
https://matrix-org.github.io/dendrite/
Apache License 2.0
5.67k stars 664 forks source link

Crashes When Joining Large Room (Memory Usage Loading History) #1939

Closed SentToDevNull closed 2 years ago

SentToDevNull commented 3 years ago

Background information

Description

I'm using the latest commit as of writing this. (I have experienced the same issue with the latest "release" as well.)

When I launch my server, it seems to work well until I try to join a room on another server with too much history.

Whenever I join any room on another server with a lot of messages, the process of joining the room hangs (visibly, seemingly forever) until my memory usage climbs so high that I have to power off my server (because there's no longer enough memory for me to spawn a shell and kill the process).

Edit: Sometimes, but not always, after it reaches this point, my scheduler appears to kill the dendrite server and I don't have to reboot.

Steps to reproduce

I'm not doing anything too complicated:

cd dendrite
./bin/dendrite-monolith-server --tls-cert /etc/letsencrypt/live/my_site.com/fullchain.pem --tls-key /etc/letsencrypt/live/my_site.com/privkey.pem --config dendrite.yaml

Here is my dendrite.yaml file.

The console output when I attempt to join other servers is just a seemingly-never-ending stream of INFO statements like the following:

INFO[2021-07-26T00:00:48.443092992Z] [send.go:206] Send
   Received transaction "1627243433681" from "techlore.net" containing 0 PDUs, 1 EDUs  req.id=7qSuTVlA8RCg req.method=PUT req.path=/_matrix/federation/v1/send/1627243433681
INFO[2021-07-26T00:00:49.138694805Z] [send.go:206] Send
   Received transaction "1627243433905" from "techlore.net" containing 0 PDUs, 1 EDUs  req.id=AKG6OVWhIwT3 req.method=PUT req.path=/_matrix/federation/v1/send/1627243433905

I think dendrite is trying to load all history since the beginning of time for every room on every server I join. It just keeps receiving transactions for messages until I run out of memory. One would expect that by default it would only load maybe a couple hundred messages or so, or at least that it would decide to stop loading history after it eats all available memory.

SentToDevNull commented 3 years ago

Update: Even killing and restarting the server doesn't stop the process. Once dendrite is launched again, it continues trying to fetch history forever.

(The only solution I've found that helps is killing the process, then completely wiping all databases, restarting, and remembering not to join rooms on other servers.)

SentToDevNull commented 3 years ago

Occasionally, I also get events like the following written to stdout:

WARN[2021-07-26T02:11:39.349828028Z] [send.go:280] processTransaction
   Transaction: Failed to query room version for room!JlytvgrOTrGPXOfjrK:techlore.net  error="QueryRoomVersionForRoom: missing room info for room !JlytvgrOTrGPXOfjrK:techlore.net" req.id=uDO5hPlLj5Ll req.method=PUT req.path=/_matrix/federation/v1/send/1627042800413
I

(Also, this is not just an issue faced when connecting to the techlore.net server. All matrix.org servers I've tried behave the same way.)

SentToDevNull commented 3 years ago

When creating rooms, there is an option to limit a user such that they only see messages created after joining the room.

When entering public rooms on other servers that allow new users to see all past history, is it possible for the user on a dendrite server to choose only to see messages generated after they join? (That could be a useful workaround.)

SentToDevNull commented 3 years ago

More Context:

This seems to be related to the backfill API: https://spec.matrix.org/unstable/server-server-api/#backfilling-and-retrieving-missing-events.

It appears that the limit for backfilling messages when joining a room is set in federationapi/routing/backfill.go#62 by the server itself. Is there an option somewhere in the codebase where the user can decide what limit to use when backfilling after joining a room on another server? If not, that seems like a very necessary feature.

neilalexander commented 3 years ago

How much RAM does the system have?

SentToDevNull commented 3 years ago

1GiB memory + 1 GiB swap

SentToDevNull commented 3 years ago

I was able to get federation to work properly (and am now able to join rooms) by limiting the number of backfill requests to make when joining a room in roomserver/internal/perform/perform_backfill.go#L474:

//tx, err := b.fsAPI.Backfill(ctx, server, roomID, limit, fromEventIDs)
tx, err := b.fsAPI.Backfill(ctx, server, roomID, 100, fromEventIDs)

There should be a better way to do this though. Perhaps we should add some logic to search for an option called backfill_limit_override in dendrite.yaml and set the limit parameter to the lowest value of either "backfill_limit_override" or "limit retrieved from server".

SentToDevNull commented 3 years ago

Update: Hard-coding an initial backfill limit upon room joining works when joining new servers only. dendrite still does crash sometimes, but at least I can restart it and join those new servers.

I am unable to join servers that I initially tried connecting to before creating a hard-coded backfill limit. Even after wiping my databases and generating a new matrix private key, regardless of whether or not I try to join them, those servers I had previously tried to join without a backfill limit spam me with backfill events that are causing my server to crash.

SentToDevNull commented 3 years ago

Though limiting the initial backfill when joining a room works, when I try to load more events (scrolling up in my Matrix client seems to request ~50 events prior to the last one I have), my server crashes after a while (without any error messages). When I restart dendrite, I am able to immediately see all those events that I just requested.

bones-was-here commented 2 years ago

I am unable to join #dendrite:matrix.org using dendrite 0.5.0 :)

It hasn't crashed or leaked a noticeable amount of memory, but this is on a far more powerful system than @SentToDevNull is using, with 32GB ram, postgres backend on nvme, monolithic binary compiled with golang 1.15.9 on debian 11.

In Element I can see the list of people, but the chat history is an endless (I gave up after about an hour) spinner. The Dendrite stdout log shows a seemingly unlimited history retrieval, with heavy usage of several CPU cores. I guess this is the same issue, just without the crash because the system doesn't run out of resources.

jackvandrunen commented 2 years ago

@bones-was-here hey, I'm looking at this issue because this is happening to me yesterday/today as well! Almost exactly the same problem, down to the server specs even, though I suspected that my slow internet was the culprit.

bones-was-here commented 2 years ago

Nah it's not caused by slow internet, the server I'm using is in a datacentre in Germany with several Gbps connection speed.

kegsay commented 2 years ago

When you join a big room there's a lot of state to check. This will cause a memory spike. I think @neilalexander has resolved this with his optimisation work over the past few months?

neilalexander commented 2 years ago

The situation is certainly better than it was, but there are still unavoidable memory spikes when joining particularly big rooms. I'd be surprised if those spikes went much bigger than a few hundred MB though, unless the auth chain is exceptionally large.

kegsay commented 2 years ago

Closing this for now then.