matrix-org / sliding-sync

Proxy implementation of MSC3575's sync protocol.
https://github.com/matrix-org/matrix-spec-proposals/pull/3575
Apache License 2.0
239 stars 37 forks source link

Initial sync uses a ton of memory on Synapse's synchrotron #258

Open turt2live opened 1 year ago

turt2live commented 1 year ago

I set up a brand new sliding sync proxy to test Element Android X, and when it actually started the poller for the first time it slowly ate all 7.5gb of memory I was able to give the synapse synchrotron worker, eventually causing OOM issues.

For comparison, an initial sync for my account on Element Desktop only uses 2-3gb of Synapse's synchrotron.

I suspect this is related to the use of an inline filter on the initial sync, but haven't confirmed. Using a pre-uploaded filter might yield better results, as I believe Synapse does the filtering in-database when using a pre-uploaded filter.

turt2live commented 1 year ago
[synchrotron_1] 2023-08-17 21:50:00,271 - synapse.access.http.8050 - 465 - INFO - GET-1- 172.18.0.3 - 8050 - {@travis:t2l.io} Processed request: 646.819sec/-345.729sec (269.006sec, 42.309sec) (133.925sec/1309.064sec/34008) 0B 200! "GET /_matrix/client/r0/sync?timeout=0&filter=%7B%22room%22%3A%7B%22timeline%22%3A%7B%22limit%22%3A1%7D%7D%7D HTTP/1.0" "sync-v3-proxy-0.99.4" [1093305 dbevts]

final tally for memory usage was 11gb.

DMRobertson commented 1 year ago

Sounds like a Synapse bug to me.

https://github.com/matrix-org/sliding-sync/blob/3bf3f2305373d69d6346d38b70be8add2a53f029/sync2/client.go#L97-L132 is the logic for generating a sync URL.

We use a filter for two purposes:

DMRobertson commented 1 year ago

Using a pre-uploaded filter might yield better results, as I believe Synapse does the filtering in-database when using a pre-uploaded filter.

I can't see any difference in the filtering logic for these two cases: https://github.com/matrix-org/synapse/blob/54317d34b76adb1e8f694acd91f631b3abe38947/synapse/rest/client/sync.py#L166-L187

turt2live commented 1 year ago

from the sliding sync internal room, a realization: the filter sliding sync uses does not lazy load room members, while Element Desktop will. This almost certainly explains the 11gb of memory required to process the initial sync.

If it's not strictly required to have all the member events, I'd suggest the proxy aggressively lazy load members.

kegsay commented 1 year ago

The proxy needs the member events at every event in order to locally calculate history visibility. E.g consider:

The proxy was not designed to handle partial room state, and adding that in would be a significant, risky and costly change.

kegsay commented 6 months ago

The scenario above is mitigated somewhat because of the cache invalidation work, coupled with https://github.com/matrix-org/sliding-sync/pull/366 - the proxy tries really hard NOT to do history visibility checks so it will cut off serving events up to the user's join event.

There's still numerous pitfalls:

We ultimately need the entire member list. Synapse ideally should stream the list back if it's too large.

kegsay commented 5 months ago

To further emphasise why we cannot using Synapse lazy loading: it's not even accurate. See https://github.com/element-hq/synapse/issues/17050 and related issues.