How detect main source of permanent high CPU usage problem of Synapse via Prometheus stats?

MurzNN commented 4 years ago

Our public ru-matrix.org homeserver with about 20 active users have very low traffic (10-100 new outgoing messages per day), but server got permanent 100% usage of CPU by Synapse process!

Server is not too slow, it have 4 vCPU, 16Gb of RAM with 6-8Gb free (cached), Synapse cache factor is 4.0.

We have Prometheus metrics and Grafana "Synapse Default" dashboard with many cool charts. But looking to this, I can't understand what can we do next to find source of so high CPU usage.

Here is screenshot of CPU usage chart:

And other interest chart, where I can't understand what it means:

I already try ask support in #synapse:matrix.org, but got no help, so fill permanent support issue, that will be available via googling for other Synapse users too.

Can you please describe here the ways to analyse main sources of high CPU usage by Synapse process, looking to many Grafana charts, based on Synapse Prometheus metrics, or links to relevant articles? I try to find some articles about this, but find nothing. Thanks!

MurzNN commented 4 years ago

Maybe problem related to huge size of state_groups_state table (274 millions of rows), issue about this here https://github.com/matrix-org/synapse/issues/3364, but I can't understand is this right or not.

anoadragon453 commented 4 years ago

I've written up a basic guide to figuring out why your Synapse instance may be running slow by looking at Grafana graphs: https://github.com/matrix-org/synapse/wiki/Understanding-Synapse-Performance-Issues-Through-Grafana-Graphs

I think that would probably be a better place to consolidate information on the topic as it is community-maintainable, and one will not have to parse many comments on a github issue.

Maybe problem related to huge size of state_groups_state table (274 millions of rows), issue about this here #7520, but I can't understand is this right or not.

I think you meant to link a different issue? As for state_groups_state Reducing the size of that table can be done through this tool: https://github.com/matrix-org/rust-synapse-compress-state

MurzNN commented 4 years ago

I think you meant to link a different issue?

Yes, I fix link to right issue.

I've written up a basic guide to figuring out why your Synapse instance may be running slow by looking at Grafana graphs: https://github.com/matrix-org/synapse/wiki/Understanding-Synapse-Performance-Issues-Through-Grafana-Graphs

Thanks, that's exactly that I need for understanding Grafana charts from Synapse stats!

Will be good also describe background jobs. For example. I see that synapse-index persist_events have too large CPU usage, what does this mean and where is source of this high usage? What does mean PDU/EDU in "Incoming PDU/EDU rate"? etc.

Here is group of charts, that not so easy to understand:

Also, in https://github.com/matrix-org/synapse/blob/master/contrib/grafana/synapse.json I don't see Transaction Count and Transaction Duration charts, does we need to make them manually?

anoadragon453 commented 4 years ago

Will be good also describe background jobs. For example. I see that synapse-index persist_events have too large CPU usage, what does this mean and where is source of this high usage? What does mean PDU/EDU in "Incoming PDU/EDU rate"? etc.

persist_events is a database transaction that... persists events to the database. Probably most likely happening when receiving federation traffic. You'll soon be able to move this task off of the master process, which should help performance here. Though you will need to run in worker mode to benefit.

PDU = Persistent Data Unit (state events, messages, stuck that sticks around) EDU = Ephemeral Data Unit (read receipts, typing notifications)

Also, in https://github.com/matrix-org/synapse/blob/master/contrib/grafana/synapse.json I don't see Transaction Count and Transaction Duration charts, does we need to make them manually?

Yes, apologies, I actually used some internal graphs and need to update the wiki with publicly available ones (or update the community version).

MurzNN commented 4 years ago

@anoadragon453 thanks for description, I add to your wiki page short info about Federation section here: https://github.com/matrix-org/synapse/wiki/Understanding-Synapse-Performance-Issues-Through-Grafana-Graphs#federation - please correct me if I write something wrong.

Also, can Prometheus split info about persist_events by event type? On my homeserver most of CPU eats this process, and I try to understand source of it. Does persist_events include EDU, or only PDU?

anoadragon453 commented 4 years ago

@MurzNN great addition, thank you!

Also, can Prometheus split info about persist_events by event type?

Not yet it seems. The tracker for outgoing EDUs is defined here and is split by type: https://github.com/matrix-org/synapse/blob/075375bbc97f16c5750c446534342b3a63d9be5a/synapse/federation/sender/per_destination_queue.py#L49-L53

Where incoming is defined here and is a total: https://github.com/matrix-org/synapse/blob/d78cb31588e01468ab06a36e6120a80fb6fbf097/synapse/federation/federation_server.py#L70

Naively I don't think there's a technical reason for why we can't do the same thing for incoming EDUs, and I think it would be useful for sysadmins. Could you make an issue for that?

Does persist_events include EDU, or only PDU?

PDUs only.

anoadragon453 commented 4 years ago

Going to close this for now as we now have the wiki page to collect notes.

Follow up questions should be directed to #synapse:matrix.org (and they hopefully should be answered eventually).

MurzNN commented 4 years ago

Naively I don't think there's a technical reason for why we can't do the same thing for incoming EDUs, and I think it would be useful for sysadmins. Could you make an issue for that?

I filled issue here https://github.com/matrix-org/synapse/issues/7666

PDUs only.

I add to wiki: "persist_events is transaction, that saving new PDU (Persistent Data Unit) events to Synapse database, EDU (Ephemeral Data Unit) is not counts in this group."

matrix-org / synapse

How detect main source of permanent high CPU usage problem of Synapse via Prometheus stats? #7520