matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.82k stars 2.13k forks source link

Very slow state res in local rooms #8612

Closed f0x52 closed 3 years ago

f0x52 commented 4 years ago

Description

pixie.town runs a local irc appservice, and for some reason some of those (larger) rooms routinely show up in the slow state-res DEBUG, taking 100-200 seconds pixie.town is the only homeserver in those rooms however (verified with devtools) so really shouldn't be much (any?) state to resolve?

14:44:10 synapse.state.metrics - 690 - DEBUG -  - 6 biggest rooms for state-res by DB time: ['!BRIDGED_NODEJS:pixie.town (131.23s)'
14:44:10 synapse.state.metrics - 690 - DEBUG -  - 6 biggest rooms for state-res by CPU time: ['!BRIDGED_NODEJS:pixie.town (18.7768s)'
14:42:10 synapse.state.metrics - 690 - DEBUG -  - 6 biggest rooms for state-res by DB time: ['!BRIDGED_NODEJS:pixie.town (158.061s)'
14:42:10 synapse.state.metrics - 690 - DEBUG -  - 6 biggest rooms for state-res by CPU time: ['!BRIDGED_NODEJS:pixie.town (21.1244s)'
14:40:10 synapse.state.metrics - 690 - DEBUG -  - 5 biggest rooms for state-res by DB time: ['!BRIDGED_NODEJS:pixie.town (215.818s)'
14:40:10 synapse.state.metrics - 690 - DEBUG -  - 5 biggest rooms for state-res by CPU time: ['!BRIDGED_NODEJS:pixie.town (49.4735s)'

Version information

If not matrix.org:

joepie91 commented 4 years ago

Additional information: the channel in question here is #Node.js on Freenode, which as of the time of writing contains 669 users. It seems to be a reproducible issue; every time I start having a (fast-paced) discussion with someone in the channel, things start slowing down and messages start taking minutes(!) to send.

anoadragon453 commented 4 years ago

State resolution is still required in rooms with only one homeserver yes, and this can be especially resource intensive in rooms that contain a lot of old extremities. Similar symptoms can often be found in IRC bridged rooms which contain thousands of membership events which often need to be walked back.

You can check the extremities count of rooms on your server in grafana, or with:

select room_id, count(*) c from event_forward_extremities group by room_id order by c desc limit 20;

Anything above 10-15 is potentially going to affect the performance of your server. Ideally the dummy event sending behaviour would be taking care of these, though we've seen situations before where that can get blocked for some reason.

joepie91 commented 4 years ago

Similar symptoms can often be found in IRC bridged rooms which contain thousands of membership events which often need to be walked back.

I'm confused. Why? I can understand state resolution being necessary when there are multiple servers participating which have a diverging view of the room, but in this case we're talking about a strictly linear log of events where no conflicts should be possible to occur, as the homeserver can internally ensure that events are processed sequentially in the correct (ie. inbound) order.

anoadragon453 commented 4 years ago

the homeserver can internally ensure that events are processed sequentially in the correct (ie. inbound) order.

Synapse currently processes up to 5 events simultaneously, so in a busy room it is still possible for state resolution to be necessary.

joepie91 commented 4 years ago

So... why is state event handling within a room not sequentialized (eg. via a per-room processing queue) to prevent this situation? That feels like the real issue to me here.

f0x52 commented 4 years ago

as for extremities, it's in the grafana dashboard at https://stats.pixie.town/d/000000012/synapse?orgId=1&refresh=5m checking the query, there's a bunch at 2, 3,4 and one at 8 but those are all rooms actually shared with other homeservers

anoadragon453 commented 4 years ago

So... why is state event handling within a room not sequentialized (eg. via a per-room processing queue) to prevent this situation? That feels like the real issue to me here.

I believe that it is currently implemented as a balance between performance and protection against an influx of events. Also sorry, that is 5 events at a time per-room. In the majority of cases it is useful to process more than one event at a time, though this can cause situations in very busy rooms where state resolution becomes necessary.

checking the query, there's a bunch at 2, 3,4 and one at 8 but those are all rooms actually shared with other homeservers

Hmm, that's not many, though I wonder if their age comes into effect here. Could you pull up the details of each please?

select e.received_ts, e.stream_ordering, e.topological_ordering from events as e, event_forward_extremities as efe where e.room_id = '!xxx:pixie.town' and efe.event_id = e.event_id;

@erikjohnston do you have any other ideas? Not sure if messages are getting blocked behind resolving membership events or something.

joepie91 commented 4 years ago

I believe that it is currently implemented as a balance between performance and protection against an influx of events. Also sorry, that is 5 events at a time per-room. In the majority of cases it is useful to process more than one event at a time, though this can cause situations in very busy rooms where state resolution becomes necessary.

Was there an actual observable performance issue that prompted the introduction of this concurrency? Considering that pretty much every other messaging system processes incoming messages linearly without issues, I'm not sure what Synapse could be doing that would make sequential processing problematic from a performance perspective. And considering that high concurrency is apparently very likely to produce stateres slowdowns, it seems to just be making the performance under load worse.

Does the event processing involve some sort of expensive query to obtain the metadata needed to store the event? If yes, why can Synapse not keep track in-memory of the metadata that is going to be needed to process the next event? Presumably there is some sort of 'running state' that can be kept for this to eliminate roundtrips to the database. Or is the performance cost of an incoming event caused by something else?

Overall, introducing concurrency into message processing within a room is very, very likely to introduce not only performance issues (apparently) but also concurrency-related bugs. I struggle to see a situation in which it would be a net improvement compared to linear processing, and based on your explanation this would seem like a prime suspect for the ongoing performance issues with Synapse.

joepie91 commented 4 years ago

Hmm, that's not many, though I wonder if their age comes into effect here. Could you pull up the details of each please?

I should also add that I have repeatedly experienced multiple subsequent lag spikes with low-latency operation for a few minutes inbetween, which suggests that it's not just old events being processed.

richvdh commented 4 years ago

@joepie91: I have to say that your tone is somewhat aggressive and not really appreciated. We can try and help you solve your issue but implying that Synapse is maintained by people who don't know what they were doing is more likely just to make us close the issue and ignore you. Please at least stop to consider that there might be good reasons that things are how they are before lashing out.

I'll also take this opportunity to remind you that we welcome Pull Requests. If you are so convinced that you have ways that Synapse should be improved, perhaps you could consider contributing positively instead of merely criticising the efforts of others.

Now, let me try and answer some of your questions:

Was there an actual observable performance issue that prompted the introduction of this concurrency?

It comes naturally from the design of Matrix: it was never "introduced".

If you'd like to experiment with disabling it, you could reduce the 5 in https://github.com/matrix-org/synapse/blob/develop/synapse/handlers/message.py#L392 to 1. If you find it helps, we'd be very happy to consider a PR to make it a configurable option.

Overall, introducing concurrency into message processing within a room is very, very likely to introduce not only performance issues (apparently) but also concurrency-related bugs.

Given that Matrix is built on the fundamental premise of a decentralised architecture, it needs to handle concurrent messages anyway, so this isn't a terribly helpful suggestion.

Does the event processing involve some sort of expensive query to obtain the metadata needed to store the event? If yes, why can Synapse not keep track in-memory of the metadata that is going to be needed to process the next event? Presumably there is some sort of 'running state' that can be kept for this to eliminate roundtrips to the database. Or is the performance cost of an incoming event caused by something else?

State resolution is a pretty complicated algorithm which can sometimes involve inspecting a significant part of the room history. Its complexity is particularly pronounced in rooms where there are large numbers of membership changes in that history - which is why IRC-bridged rooms can often be heavily affected. So, yes, there is potentially a large amount of data that needs to be read from the database, and indeed Synapse should cache much of it.

The fact that you're seeing such high database usage may suggest that certain caches are undersized for your usage pattern. Unfortunately, tuning Synapse's caches can be a bit of a black art currently (it's an area we hope to improve in the future). A good starting point is checking the "Cache misses" prometheus metrics. https://github.com/matrix-org/synapse/wiki/Understanding-Synapse-Performance-Issues-Through-Grafana-Graphs#caches has some notes on this.

joepie91 commented 4 years ago

I have to say that your tone is somewhat aggressive and not really appreciated. We can try and help you solve your issue but implying that Synapse is maintained by people who don't know what they were doing is more likely just to make us close the issue and ignore you. Please at least stop to consider that there might be good reasons that things are how they are before lashing out.

I can understand that I'm coming across like that. But please also understand that these performance issues have been ongoing for a long long time, they are not obviously explainable, and it's incredibly frustrating to be faced on an almost-daily basis with unexpected 5-10 minute delays on messages in the middle of a conversation or technical explanation essentially making any real conversation impossible to have, when no other systems I have ever used (including decentralized and distributed ones) have ever had lag spikes that are so bad.

I am not implying that the maintainers of Synapse don't know what they're doing, but I am of the opinion that there is a serious problem here that looks like a fundamental architectural issue somewhere, that is not getting addressed well enough to actually make Matrix something I can recommend to others to use. I want to see Matrix succeed just like everyone else, and having the sole feature-complete homeserver implementation have performance issues this bad is a real problem not just for me personally but also for Matrix as an ecosystem.

I've been digging into the protocol quite a lot over the past few months, both to work on my own client and server implementations, and to try and better understand where any inherent performance issues might exist in the protocol. I haven't found any yet. That leaves me with the conclusion that it appears to be a problem specific to Synapse's architecture.

That is what I'm trying to narrow down here and why I'm asking questions about why things work how they do, and yes, after months of significantly impaired communication, I have gotten a little touchy about it, especially when I get responses along the lines of "oh well dunno what causes it :shrug:" like I've seen in the past.

I'll also take this opportunity to remind you that we welcome Pull Requests. If you are so convinced that you have ways that Synapse should be improved, perhaps you could consider contributing positively instead of merely criticising the efforts of others.

I find it a little disingenuous to suggest that I am not "contributing positively". I've filed a considerable amount of detailed issues among the various Matrix projects tracking down spec issues, non-compliance, and so on. I've had quite a few discussions trying to better understand the architecture of things, and suggesting alternative approaches. I've been working on different database designs that reduce the cost of state resolution as much as possible. I've already offered in the past to help out on significant refactoring of the JS SDK,

Just being I'm being frustrated about a long-standing issue and haven't made a PR to fix that specific issue (I can't even tell where exactly the issue is yet!) doesn't mean that I'm just here to "criticize the efforts of others". Please consider that we're working towards the same goal here, and that I actually feel that there is a significant architectural problem somewhere that I am trying to track down, with my still-limited understanding of the Synapse codebase.

It comes naturally from the design of Matrix: it was never "introduced".

Matrix as a protocol might be designed as a decentralized database, but the same does not apply for the message processing pipeline of a homeserver implementation, which is presumably going to have a single in-memory model of a room, an event stream, and so on. Considering that messages are processed with a concurrency of specifically 5, that leads me to assume that the concurrent handling was an intentional design decision, as opposed to the baseline situation. It would have been different had the concurrency been unbounded.

If you'd like to experiment with disabling it, you could reduce the 5 in https://github.com/matrix-org/synapse/blob/develop/synapse/handlers/message.py#L392 to 1. If you find it helps, we'd be very happy to consider a PR to make it a configurable option.

I've asked @f0x52 to make that change, and will report back with the results in a few days, probably.

Given that Matrix is built on the fundamental premise of a decentralised architecture, it needs to handle concurrent messages anyway, so this isn't a terribly helpful suggestion.

This is only partially true. Notably, for private bridges, it is very unlikely that interaction with other homeservers ever needs to take place; and so it makes sense to design the process such that the local 'baseline behaviour' is that of a centralized system, so long as no other homeservers are involved. Especially with IRC bridges, which can produce a lot of events in a single channel, as you mention.

Likewise, even in a situation where other homeservers are involved, I suspect a lot of unnecessary stateres can be avoided by treating all local messages as 'trusted' from a DAG perspective (which they are) and handling them sequentially; only merging in external event feeds from other homeservers as activity from them actually occurs.

All in all, it just doesn't make a lot of sense to throw local events into the same ":shrug: decentralized events" bucket as externally-received events, which is what I understand to be how it works now. Why not prevent the need for stateres where possible, considering it is apparently such an intensive process?

The fact that you're seeing such high database usage may suggest that certain caches are undersized for your usage pattern. Unfortunately, tuning Synapse's caches can be a bit of a black art currently (it's an area we hope to improve in the future). A good starting point is checking the "Cache misses" prometheus metrics. https://github.com/matrix-org/synapse/wiki/Understanding-Synapse-Performance-Issues-Through-Grafana-Graphs#caches has some notes on this.

The caches on pixie.town have been adjusted many times in the past, as a result of this same advice, and this has not led to any lasting performance improvements. The pixie.town Grafana is also publicly accessible, if that's of any use.

richvdh commented 4 years ago

I don't feel like it will be fruitful to continue this discussion, so I will simply say: I appreciate your frustration, but please remember we are doing our best, with limited time, to support the ecosystem. We cannot do everything, and we have to balance time spent supporting individual synapse admins with trying to develop Synapse. In short, please be patient with those who seek to help you.

Considering that messages are processed with a concurrency of specifically 5, that leads me to assume that the concurrent handling was an intentional design decision, as opposed to the baseline situation. It would have been different had the concurrency been unbounded.

I may be misunderstanding your point, but to be clear: the concurrency limit was added in https://github.com/matrix-org/synapse/pull/1620; before that, any number of events would be handled concurrently.

The caches on pixie.town have been adjusted many times in the past, as a result of this same advice, and this has not led to any lasting performance improvements

As I said, the fact that caches in synapse need manual tuning is a thing that we are very aware of, and hope to improve in future. It's sadly not the case that, once tuned, you can call the job done. https://github.com/matrix-org/synapse/issues/3876 is the umbrella bug here.

The pixie.town Grafana is also publicly accessible, if that's of any use.

Well, one thing it tells me is that it looks like the cache for get_users_in_room is undersized:

image image

That probably isn't the direct cause of the lag you mention in this case, but could be contributing to general poor performance (which in turn makes it more likely that events get created concurrently leading to more complex state res down the line as well) as leaving less headroom for big queries.

Correlating the spikes in event send time with the database transaction graph...

image image

... we can see that the problem is get_auth_chain_difference. I'm afraid there's not a lot I can say here, other than "yes, that's a (potentially expensive) query that happens during state resolution." There may be scope to improve its performance: I've opened https://github.com/matrix-org/synapse/issues/8622 to track that.

joepie91 commented 4 years ago

I don't feel like it will be fruitful to continue this discussion, so I will simply say: I appreciate your frustration, but please remember we are doing our best, with limited time, to support the ecosystem. We cannot do everything, and we have to balance time spent supporting individual synapse admins with trying to develop Synapse. In short, please be patient with those who seek to help you.

I understand that. I'm definitely not asking for dedicated support here; I'm just trying to understand how the architecture ties together, and trying to find some kind of fundamental issue that explains the performance discrepancy with other software, because the numbers just don't add up to me. Unfortunately the Synapse architecture is still largely a mystery to me (and I've not found it particularly easy to understand it, with the limited time / attention span I'm able to dedicate to it), hence my questions.

For what it's worth, it provisionally seems like reducing the concurrency to 1 has significantly stabilized the performance of pixie.town (without introducing an observable performance hit), but I will be keeping an eye on it for a week or so to ensure that this is not a fluke. I have not experienced any send lag spikes whatsoever since the change.

If this result persists, then I suspect - from my current understanding of how Synapse works - that significant performance gains could be had from just processing everything sequentially (per-room) and keeping a 'running tally' of certain expensive room state in memory separately from other caches, removing the need for a lot of database roundtrips and cache tuning - which would not be possible as-such with concurrent processing, because then there is not one singular "current room state" AIUI.

I may be misunderstanding your point, but to be clear: the concurrency limit was added in #1620; before that, any number of events would be handled concurrently.

Right, so if I understand your answer correctly, this indirectly answers my question, in that fully-sequential operation was never attempted, and Synapse has always operated in some sort of concurrent mode, whether unbounded or not. Is that correct?

(I'm asking this mainly in case what I'm suggesting has been tried before and found not to work, for some reason I am not aware of.)

As I said, the fact that caches in synapse need manual tuning is a thing that we are very aware of, and hope to improve in future. It's sadly not the case that, once tuned, you can call the job done. #3876 is the umbrella bug here.

That makes sense in principle, but "it starts failing again after a few days" seems like an exceptionally short time for cache parameter changes to become ineffective, on a homeserver with a small and fixed set of users, and seems to suggest something else going on - perhaps it only reduces the load just enough to avoid cascading slowness as a result of a slow query, and a few days later the load has risen slightly again?

I don't think we've ever seen any cache-related improvement last for more than a few days, though perhaps @f0x52 can weigh in on that.

... we can see that the problem is get_auth_chain_difference. I'm afraid there's not a lot I can say here, other than "yes, that's a (potentially expensive) query that happens during state resolution." There may be scope to improve its performance: I've opened #8622 to track that.

Thanks, I've subscribed to the issue.

joepie91 commented 4 years ago

So, an update on the max_count=1 experiment. Findings so far:

Some graphs are provided below. The dashed blue line indicates the moment (21 oct, 20:00) that the experiment started, ie. the event processing concurrency changed to 1.

Selection_999(1010) Selection_999(1004) Selection_999(1003) Selection_999(1002)

Finally, a comparison of the "Events Persisted" graphs before vs. after the experiment started, to illustrate how the jitter spikes have been reduced:

Selection_999(1009) Selection_999(1008)

In conclusion: Having the concurrency be configurable would be a great thing, as it has clearly helped here.

I suspect it would also be reasonable to just handle events sequentially by default for everything, which will likely also allow other optimizations because the event processing code can then operate on a 'rolling state' that's kept explicitly in-memory, rather than needing to make roundtrips to the database. However, this experiment has a sample size of 1, and this should probably be replicated with other Synapse deployments first.


Unrelatedly, @f0x52 discovered that due to a configuration problem, the cache factor silently failed to take effect. After fixing this the fix has been temporarily rolled back to as to not affect the results of the experiment too much. Interestingly, changing the cache factor actually subjectively appeared to result in worse performance, even though the CPU usage went down significantly. But that's perhaps a concern for another issue.

If Synapse is not already doing so, it would be useful to print out such configuration on startup, so that the HS admin can verify that their changes have correctly taken effect.

joepie91 commented 4 years ago

Addendum, but with samplesize=1, Synapse now also seems to recover more quickly and graciously from significant backlogs.

The pixie.town Synapse broke overnight due to #8574, and was restored about 9 hours later. Normally this results in easily 30+ minutes of a nearly-inoperable Synapse (or sometimes hours!) as it attempts to catch up, but this time it was only inoperable for about 5 minutes, after which normal operation resumed with only some (quite acceptable) latency, while it continued catching up in the background.

This is very much not certain, since I only have a single such incident to draw from and no comparative metrics, but I suspect that the concurrent backlog processing would normally result in a lot of unnecessary branching due to the sudden firehose of inbound events, making catch-up take much longer - and that was now avoided by processing it sequentially.

Edit: Sample size of 2 now, Synapse failed again for the same reason last night, this time it took about 7-8 minutes to recover.

joepie91 commented 3 years ago

Update: performance has regressed severely since Synapse 1.23.0, even with the concurrency fix enabled, which previously solved the problem. Logs show IRC rooms consistently being the cause of slow stateres, which should not be possible considering that they only contain local events.

Enabling or disabling the higher cache factor or jemalloc does not seem to have any effect. It's still frequently getting lagspikes of minutes. Below is a graph for one such lagspike today, with a crazy high amount of cache misses for event_auth_cache events:

Selection_229

Even with a limited cache factor, I have difficulty seeing a good reason for tens to hundreds of thousands(!) of such queries, for what seems like the stateres of a single IRC room (#Node.js) which should only have local events to begin with. It really feels like there's a bug here somewhere, separate from #8622.

callahad commented 3 years ago

@joepie91 I'd be curious to know if #8971 (in 1.25.0rc1) meaningfully improves things on your end

joepie91 commented 3 years ago

@callahad Reporting back as promised: yes, this seems to have resolved the lagspikes.

This raises the question, however: why did it? As I understand it, that change should have only affected rooms with frequent powerlevel events (ref. an earlier discussion in #synapse-dev), but the rooms which Synapse frequently blames for slow stateres, are not rooms with frequent powerlevel events.

I will have to look into this more. It's possible that this is hiding a deeper bug.

callahad commented 3 years ago

To try and sum up:

On that basis, I'm going to close this and consider it resolved by the fix in 1.25.0.

The main caveat is that, per @joepie91's understanding, it is not obvious that the fix should have affected the rooms at issue. However, the titular issue — that state res in local rooms is slow — seems to have been resolved.

@joepie91 If you find we are indeed doing state resolution in places where we shouldn't, I'd appreciate it if you could file a separate bug so we can track that specifically. Thank you for your patience and willingness to dig into this (and to try the rc!)