Open turt2live opened 5 years ago
Additional comments from #synapse-dev:matrix.org https://matrix.to/#/!yZHTGeDKZUeKaqeTeU:matrix.org/$15577656511618KXcjl:t2l.io?via=matrix.org&via=sw1v.org&via=amorgan.xyz @turt2live
I have additional thoughts if it helps: EDUs are generally less important than messages so processing them async to PDUs makes sense to me. In general, clients can handle receiving an encrypted event then the applicable EDUs afterwards too, so that isn't really a case to worry about. and there's no reason to block the entire request on the processing of EDUs because the response doesn't really care
@erikjohnston
So, two things:
Some EDUs really do need to be reliable, e.g. device list updates etc, and if you process them async then they no longer become reliable Returning before you're done processing means you'll get more EDUs to process before you've finished the last batch, which could easily cause issues in itself
Based on the above we'd need to rethink how EDUs are used before it would be safe to process all of them in the background.
Author's note: Sorry about the short story here. The background information feels important to describe one possible use case for implementing this feature, even if it is rare.
The spec doesn't actually say that they should be handled in sync with the request ( https://matrix.org/docs/spec/server_server/r0.1.1.html#put-matrix-federation-v1-send-txnid ), which may be a bug itself, however it would be nice to have EDUs processed async to PDUs.
For a bit of background: t2bot.io has ignored each and every EDU for the last year for performance reasons. Nothing on the server has used incoming EDUs for anything, so it was just a wasted amount of processing power to deal with them.
Due to an unrelated problem, the server fell behind on federation and was catching back up. During that recovery period, I decided to re-enable EDU processing for encryption-related stuff for a future offering on the server (namely bots which are e2e-capable).
Device list updates then started pouring in, taking on average 300s each to process. This resulted in PDUs being even further delayed because replication would often time out between the federation reader and main process for these device list updates. When the timeout occurred, the federation reader would 500 the transaction request and the server would re-send it.
Although it's likely my fault for device list updates taking 300s to process (given they weren't touched for a year), it would be nice if EDUs could be handled async to PDUs so that PDUs can continue to go through the server without being blocked by the less-important EDUs (in my case they are less important, at least).
Also worth noting that CPU usage was fantastically small when running into 300s device list updates processing times. Normally I'd expect some amount of torture on the system for request times like that, but instead it looked like traffic was back to small homeserver levels: