Sending typing EDUs causes high load in the federationapi

S7evinK commented 2 years ago

More as a reminder. Possibly related, as this can degrade QoS - #1622 and maybe #2079

Background information

Dendrite version or git SHA: 5106cc807cf22a95420b24f6bfdd5c9ac8aa06de
Monolith or Polylith?: polylith
SQLite3 or Postgres?: postgres
Running in Docker?: more or less
go version: 1.17.x
Client used (if applicable): Element Web 1.10.2-rc.2

Description

What is the problem: High CPU load when sending typing EDUs in crowded room
Who is affected: me (but would affect other users)
How is this bug manifesting: long delay in sending messages, as the fedapi is busy
When did this first appear: today, after wiping my database

Steps to reproduce

join #dendrite:matrix.org
start typing a message
wait 30s (typing timeout)
start typing again

Several of those log entries:

time="2022-02-11T23:22:20.444381367Z" level=info msg="Sending EDU event" destinations=888 edu_type=m.typing

Disabling Send typing notifications in Element Web helps in this case, but stuff like read markers could probably result in the same behavior on busy servers/rooms.

kegsay commented 2 years ago

This happens because Dendrite hasn't yet blacklisted many of those servers. Attempting to send data to those servers causes high load.

neilalexander commented 2 years ago

The federation API creates a goroutine for each destination — so in this case, 888 goroutines for each of the 888 destinations. That does create a spike as each destination queue wakes up, checks the database for things to send and then creates federation connections. We probably want to run a profile sometime to find out exactly which part of the process ends up being the most expensive, as I can quite believe that it's the database operations that are using the most CPU time.

We see similar spikes on dendrite.matrix.org and similar, so we might want to come up with a way of limiting the number of goroutines that are created for outbound federation in general, but I suspect that may end up meaning that some transactions to some servers take longer to send if they end up queued behind others.

kegsay commented 2 years ago

A worker pool model may be better here e.g. hash(server_name) mod N for N workers. The workers can be either always there or created and killed on demand. The former is simpler but then the goroutines sit around forever, which may not be a problem as parked goroutines aren't particularly expensive?

matrix-org / dendrite