Closed Ralith closed 6 years ago
@Ralith: sorry, I'm not able to reproduce this, or figure out why it might be happening. Are there any ERRORs or CRITICALs in the log before this?
We run into the same issue:
2018-10-02 16:13:11,768 - twisted - 243 - CRITICAL - - Unhandled Error
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/synapse/app/_base.py", line 104, in run
reactor.run()
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1261, in run
self.mainLoop()
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1270, in mainLoop
self.runUntilCurrent()
File "/usr/lib/python2.7/dist-packages/synapse/metrics/__init__.py", line 349, in f
ret = func(*args, **kwargs)
--- <exception caught here> ---
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 896, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 671, in _tick
taskObj._oneWorkUnit()
File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 521, in _oneWorkUnit
self._completeWith(TaskFailed(), Failure())
File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 477, in _completeWith
self._cooperator._removeTask(self)
File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 644, in _removeTask
self._tasks.remove(task)
exceptions.ValueError: list.remove(x): x not in list
This is occuring a lot:
# grep 'exceptions.ValueError: list.remove(x): x not in list' homeserver.log | wc -l
6351
# grep 'exceptions.ValueError: list.remove(x): x not in list' homeserver.log.1 | wc -l
2082
ok, please can you send me more detailed logs.
I tried to get all the individual CRITICALs and ERRORs leading towards the exceptions.ValueError: list.remove(x): x not in list
here: https://www.pastery.net/cnndhr/
@gusttt : hrm, a bunch of things there, some of which are known, and some of which I have opened issues for:
However, I think the above are all relatively harmless, and do not appear to be directly related to this bug.
So far I have established that this exception is caused if a cooperative task (a) stops itself and (b) raises an exception during its first iteration. I think this might happen if an outbound federation connection drops before we have written the body of a request (so might be somewhat related to #4003) - I need to investigate more. However, that wouldn't really explain why there are thousands of this error.
I think what might help would be complete logs. You can reach me at @richvdh:sw1v.org
on matrix if you would prefer not to post the logs publicly.
Thank you for the feedback. I provided the log file in private.
We don't have the exceptions.ValueError: list.remove(x): x not in list
error any more.
Changes:
0.33.6
=> the error was no longer spamming, less then 10 occurrences after the update.Perhaps it is the combination of #1760 and something else that leads to this error and unresponsiveness.
matrix-synapse
looks to be snappy again. Thanks a lot for the assistance @richvdh
After staring at this for really far too long, I have finally managed to reproduce this, and my conclusions are that the exception being logged is a relatively harmless side-effect of synapse being overloaded and getting behind.
I've raised https://twistedmatrix.com/trac/ticket/9547 to track the noisy exceptions from the twisted side.
Anyway, this really amounts to "Synapse becomes unresponsive", which in all honesty could be due to any number of reasons. I don't think having a single bug to track every reason Synapse could get overloaded is useful, so I'm going to close this. For anyone finding this issue, and @Ralith, please:
Description
0.33.5 becomes unresponsive after a few minutes, utterly flooding the logs with this error:
0.33.3.1 seems to work fine.
Version information