matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.82k stars 2.13k forks source link

Synapse becomes unresponsive spamming "exceptions.ValueError: list.remove(x): x not in list" errors #3981

Closed Ralith closed 6 years ago

Ralith commented 6 years ago

Description

0.33.5 becomes unresponsive after a few minutes, utterly flooding the logs with this error:

twisted: [] Unhandled Error
Traceback (most recent call last):
  File "/nix/store/4isjzpv4qm10x99inhal77vfd9y9j2h8-matrix-synapse-0.33.5/lib/python2.7/site-packages/synapse/app/_base.py", line 104, in run
    reactor.run()
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/base.py", line 1261, in run
    self.mainLoop()
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/base.py", line 1270, in mainLoop
    self.runUntilCurrent()
  File "/nix/store/4isjzpv4qm10x99inhal77vfd9y9j2h8-matrix-synapse-0.33.5/lib/python2.7/site-packages/synapse/metrics/__init__.py", line 349, in f
    ret = func(*args, **kwargs)
--- <exception caught here> ---
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/base.py", line 896, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/task.py", line 671, in _tick
    taskObj._oneWorkUnit()
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/task.py", line 521, in _oneWorkUnit
    self._completeWith(TaskFailed(), Failure())
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/task.py", line 477, in _completeWith
    self._cooperator._removeTask(self)
  File "/nix/store/w3iklq60jl9wjgb9nv1vlj122fm778wp-python2.7-Twisted-18.7.0/lib/python2.7/site-packages/twisted/internet/task.py", line 644, in _removeTask
    self._tasks.remove(task)
exceptions.ValueError: list.remove(x): x not in list

0.33.3.1 seems to work fine.

Version information

richvdh commented 6 years ago

@Ralith: sorry, I'm not able to reproduce this, or figure out why it might be happening. Are there any ERRORs or CRITICALs in the log before this?

gusttt commented 6 years ago

We run into the same issue:

2018-10-02 16:13:11,768 - twisted - 243 - CRITICAL - - Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/synapse/app/_base.py", line 104, in run
    reactor.run()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1261, in run
    self.mainLoop()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1270, in mainLoop
    self.runUntilCurrent()
  File "/usr/lib/python2.7/dist-packages/synapse/metrics/__init__.py", line 349, in f
    ret = func(*args, **kwargs)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 896, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 671, in _tick
    taskObj._oneWorkUnit()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 521, in _oneWorkUnit
    self._completeWith(TaskFailed(), Failure())
  File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 477, in _completeWith
    self._cooperator._removeTask(self)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 644, in _removeTask
    self._tasks.remove(task)
exceptions.ValueError: list.remove(x): x not in list

This is occuring a lot:

# grep 'exceptions.ValueError: list.remove(x): x not in list' homeserver.log | wc -l
6351
# grep 'exceptions.ValueError: list.remove(x): x not in list' homeserver.log.1 | wc -l
2082

Version information

richvdh commented 6 years ago

ok, please can you send me more detailed logs.

gusttt commented 6 years ago

I tried to get all the individual CRITICALs and ERRORs leading towards the exceptions.ValueError: list.remove(x): x not in list here: https://www.pastery.net/cnndhr/

richvdh commented 6 years ago

@gusttt : hrm, a bunch of things there, some of which are known, and some of which I have opened issues for:

However, I think the above are all relatively harmless, and do not appear to be directly related to this bug.

So far I have established that this exception is caused if a cooperative task (a) stops itself and (b) raises an exception during its first iteration. I think this might happen if an outbound federation connection drops before we have written the body of a request (so might be somewhat related to #4003) - I need to investigate more. However, that wouldn't really explain why there are thousands of this error.

I think what might help would be complete logs. You can reach me at @richvdh:sw1v.org on matrix if you would prefer not to post the logs publicly.

gusttt commented 6 years ago

Thank you for the feedback. I provided the log file in private.

gusttt commented 6 years ago

We don't have the exceptions.ValueError: list.remove(x): x not in list error any more. Changes:

Perhaps it is the combination of #1760 and something else that leads to this error and unresponsiveness. matrix-synapse looks to be snappy again. Thanks a lot for the assistance @richvdh

richvdh commented 6 years ago

After staring at this for really far too long, I have finally managed to reproduce this, and my conclusions are that the exception being logged is a relatively harmless side-effect of synapse being overloaded and getting behind.

I've raised https://twistedmatrix.com/trac/ticket/9547 to track the noisy exceptions from the twisted side.

Anyway, this really amounts to "Synapse becomes unresponsive", which in all honesty could be due to any number of reasons. I don't think having a single bug to track every reason Synapse could get overloaded is useful, so I'm going to close this. For anyone finding this issue, and @Ralith, please: