Open S7evinK opened 2 years ago
This happens because Dendrite hasn't yet blacklisted many of those servers. Attempting to send data to those servers causes high load.
The federation API creates a goroutine for each destination — so in this case, 888 goroutines for each of the 888 destinations. That does create a spike as each destination queue wakes up, checks the database for things to send and then creates federation connections. We probably want to run a profile sometime to find out exactly which part of the process ends up being the most expensive, as I can quite believe that it's the database operations that are using the most CPU time.
We see similar spikes on dendrite.matrix.org and similar, so we might want to come up with a way of limiting the number of goroutines that are created for outbound federation in general, but I suspect that may end up meaning that some transactions to some servers take longer to send if they end up queued behind others.
A worker pool model may be better here e.g. hash(server_name) mod N for N workers. The workers can be either always there or created and killed on demand. The former is simpler but then the goroutines sit around forever, which may not be a problem as parked goroutines aren't particularly expensive?
More as a reminder. Possibly related, as this can degrade QoS - #1622 and maybe #2079
Background information
go version
: 1.17.xDescription
Steps to reproduce
Several of those log entries:
Disabling
Send typing notifications
in Element Web helps in this case, but stuff like read markers could probably result in the same behavior on busy servers/rooms.