matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.79k stars 2.13k forks source link

Federation readers regularly get stuck at 100% CPU #10698

Closed turt2live closed 3 years ago

turt2live commented 3 years ago

Description

Federation reader workers regularly get stuck at 100% CPU for not explicable reason (graphs don't reveal why it's spinning CPU). Investigation is needed into what it is doing, and why this is happening.

The impact of the readers being down generally doesn't affect the local server itself, but does affect every other server's ability to interact with that server, such as joining rooms, looking up profiles, inviting other members, etc. The failure ends up making the other servers look bad when the cause isn't their fault :(

Steps to reproduce

Unclear

Version information

erikjohnston commented 3 years ago

It seems completely wedged merging dicts. Python stack trace from GDB:

Traceback (most recent call first):
  File "/home/synapse/src/synapse/storage/databases/main/events_worker.py", line 583, in _get_events_from_cache_or_db
    event_entry_map.update(result)
  File "/home/synapse/src/synapse/storage/databases/main/events_worker.py", line 383, in get_events_as_list
    set(event_ids), allow_rejected=allow_rejected
  File "/home/synapse/src/synapse/storage/databases/main/event_federation.py", line 93, in get_auth_chain
    return await self.get_events_as_list(event_ids)
  File "/home/synapse/src/synapse/federation/federation_server.py", line 604, in on_send_join_request
    auth_chain = await self.store.get_auth_chain(room_id, state_ids)
  File "/home/synapse/src/synapse/federation/transport/server/federation.py", line 358, in on_PUT
    result = await self.handler.on_send_join_request(origin, content, room_id)

C stack trace:

#0  dk_get_index (i=<optimized out>, keys=<optimized out>) at Objects/dictobject.c:842
#1  lookdict_unicode_nodummy (value_addr=0x7fff0ad2f408, hash=7255247804242703278, key=0x7fc776e99220, mp=0x7fc7d428d048) at Objects/dictobject.c:845
#2  lookdict_unicode_nodummy (mp=<optimized out>, key=<optimized out>, hash=<optimized out>, value_addr=<optimized out>, mp=<optimized out>, key=<optimized out>, hash=<optimized out>, 
    value_addr=<optimized out>) at Objects/dictobject.c:826
#3  0x000055c499a283ce in insertdict (mp=0x7fc7d428d048, key=0x7fc776e99220, hash=7255247804242703278, value=0x7fc2e2e13e98) at Objects/dictobject.c:1015
#4  0x000055c499af308f in dict_merge (a=0x7fc7d428d048, b=0x7fc3a57bf898, override=1) at Objects/dictobject.c:2428
#5  0x000055c499af327b in PyDict_Merge (override=1, b=<optimized out>, a=0x7fc7d428d048) at Objects/dictobject.c:2513
#6  dict_update_common (methname=0x55c499bcfb10 "update", kwds=0x0, args=<optimized out>, self=0x7fc7d428d048) at Objects/dictobject.c:2250
#7  dict_update (self=0x7fc7d428d048, args=<optimized out>, kwds=0x0) at Objects/dictobject.c:2272
erikjohnston commented 3 years ago

Py-spy:

Screenshot from 2021-08-25 21-55-37

erikjohnston commented 3 years ago

Nothing was logged after it wedged, which suggests it was making zero forward progress.

anoadragon453 commented 3 years ago

matrix.org is on commit fe3466a8ade18774d31a2dfe0af1c3ddb52a3dce.

Room was !xPRJVyOhRymgpkarYZ:matrix.org, aka #freenode_#freenode:matrix.org. The operation was part of getting the auth events on a /send_join from archlinux.org (Synapse 1.40.0).

anoadragon453 commented 3 years ago

Federation readers 1 & 2 both experienced the same issue at roughly 10am UTC today. The latest logs from the processes:

Federation reader 1

2021-08-26 10:00:39,369 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-91522393 - Caught up with stream 'events' to 2261887448
2021-08-26 10:00:39,404 - synapse.access.http.8084 - 410 - INFO - GET-13210389 - 178.254.52.123 - 8084 - {asra.gr} Processed request: 0.010sec/-0.000sec (0.001sec, 0.000sec) (0.001sec/0.002sec/2) 654B 200 "GET /_matrix/federation/v1/user/devices/%40matterbridge1%3Amatrix.org HTTP/1.1" "Synapse/1.41.0" [0 dbevts]
2021-08-26 10:00:39,406 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-140529943617816 - Handling 'POSITION events event_persister-1 2261887448 2261887449'
2021-08-26 10:00:39,406 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-91522399 - Caught up with stream 'events' to 2261887449
2021-08-26 10:00:39,408 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-140529943617816 - Handling 'POSITION events event_persister-3 2261887448 2261887449'
2021-08-26 10:00:39,408 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-91522400 - Caught up with stream 'events' to 2261887449
2021-08-26 10:00:39,429 - synapse.access.http.8084 - 410 - INFO - GET-13210390 - 2a02:c207:2014:8709::1 - 8084 - {matrix.virtualpudding.org} Processed request: 0.010sec/-0.000sec (0.001sec, 0.000sec) (0.001sec/0.003sec/2) 2723B 200 "GET /_matrix/federation/v1/user/devices/%40Kobberholm%3Amatrix.org HTTP/1.1" "Synapse/1.31.0" [0 dbevts]
2021-08-26 10:00:39,432 - synapse.access.http.8084 - 410 - INFO - GET-13210376 - 2003:a:b4c:7301::4:130 - 8084 - {matrix02.m31.zone} Processed request: 0.507sec/-0.000sec (0.003sec, 0.000sec) (0.000sec/0.001sec/1) 1005B 200 "GET /_matrix/federation/v1/event/$Nr9pSLjHPMLgwboRXBENaubm-cXZcMJswYMMZfIYiE8 HTTP/1.1" "Dendrite/0.4.1+125ea75b" [0 dbevts]
2021-08-26 10:00:39,442 - synapse.access.http.8084 - 410 - INFO - GET-13210391 - 165.227.224.215 - 8084 - {project-nex.net} Processed request: 0.009sec/-0.000sec (0.003sec, 0.000sec) (0.001sec/0.003sec/3) 4812B 200 "GET /_matrix/federation/v1/user/devices/%40dinosaurdynasty%3Amatrix.org HTTP/1.1" "Synapse/1.37.1" [0 dbevts]
2021-08-26 10:00:39,470 - synapse.access.http.8084 - 410 - INFO - GET-13210377 - 2003:a:b4c:7301::4:130 - 8084 - {matrix02.m31.zone} Processed request: 0.521sec/-0.000sec (0.003sec, 0.001sec) (0.001sec/0.009sec/3) 963B 200 "GET /_matrix/federation/v1/event/$bWnGzbOBCt-4dzGLDf7m1DkM419ns3f3v0f3wASTPm8 HTTP/1.1" "Dendrite/0.4.1+125ea75b" [1 dbevts]
2021-08-26 14:52:36,505 - root - 336 - WARNING - main - ***** STARTING SERVER *****

Federation reader 2

2021-08-26 10:01:41,918 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-140180607468312 - Handling 'POSITION events event_persister-2 2261889087 2261889088'
2021-08-26 10:01:41,919 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-90625949 - Caught up with stream 'events' to 2261889088
2021-08-26 10:01:41,931 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-140180607468312 - Handling 'POSITION events event_persister-2 2261889088 2261889090'
2021-08-26 10:01:41,932 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-90625953 - Caught up with stream 'events' to 2261889090
2021-08-26 10:01:41,946 - synapse.9533_debug - 435 - DEBUG - process-replication-data-90625956 - to-device messages stream id 315805302, awaking streams for ['@nelsond:matrix.org']
2021-08-26 10:01:41,956 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-140180607468312 - Handling 'POSITION events event_persister-2 2261889090 2261889091'
2021-08-26 10:01:41,956 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-90625957 - Caught up with stream 'events' to 2261889091
2021-08-26 10:01:42,004 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-140180607468312 - Handling 'POSITION events event_persister-1 2261889092 2261889093'
2021-08-26 10:01:42,005 - synapse.replication.tcp.handler - 562 - INFO - process-replication-data-90625965 - Caught up with stream 'events' to 2261889093
2021-08-26 14:52:36,506 - root - 336 - WARNING - main - ***** STARTING SERVER *****

We restarted each which got them going again.