matrix-org / synapse-s3-storage-provider

Synapse storage provider to fetch and store media in Amazon S3
Apache License 2.0
118 stars 33 forks source link

s3_storage_provider losts connect to bucket #78

Closed DeezyTheFox closed 1 year ago

DeezyTheFox commented 1 year ago

Hello! We have a repeatable but unregular bug with connection to s3. Sometimes media stopped to upload to matrix, clients got timeouts by reserse-proxy. Restart of main synapse process helps for some time. We can't cath this and can't reproduce. Logs spammed by few messages.

Spams everytime, regardless of the problem: urllib3.connectionpool - 305 - WARNING - GET-61775200 - Connection pool is full, discarding connection: [s3 address]. Connection pool size: 10

Cathed sometimes, spammed when there is a problem:

twisted - 274 - CRITICAL - sentinel - Unhandled Error
Traceback (most recent call last):
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/app/_base.py", line 148, in <lambda>
    run_command: Callable[[], None] = lambda: reactor.run(),
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/twisted/internet/base.py", line 1315, in run
    self.mainLoop()
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/twisted/internet/base.py", line 1328, in mainLoop
    reactorBaseSelf.doIteration(t)
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/twisted/internet/epollreactor.py", line 244, in doPoll
    log.callWithLogger(selectable, _drdw, selectable, fd, event)
--- <exception caught here> ---
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/twisted/python/log.py", line 96, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/twisted/python/log.py", line 80, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/twisted/python/context.py", line 117, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/twisted/python/context.py", line 82, in callWithContext
    return func(*args, **kw)
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/twisted/internet/posixbase.py", line 696, in _doReadOrWrite
    self._disconnectSelectable(selectable, why, inRead)
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/twisted/internet/posixbase.py", line 297, in _disconnectSelectable
    selectable.readConnectionLost(f)
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/twisted/internet/tcp.py", line 309, in readConnectionLost
    self.connectionLost(reason)
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/twisted/internet/tcp.py", line 320, in connectionLost
    abstract.FileDescriptor.connectionLost(self, reason)
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/twisted/internet/abstract.py", line 206, in connectionLost
    self.producer.stopProducing()
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/twisted/web/http.py", line 2754, in stopProducing
    self._requestProducer.stopProducing()
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse_s3_storage_provider-1.1-py3.8.egg/s3_storage_provider.py", line 314, in stopProducing
    with LoggingContext():
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/logging/context.py", line 329, in __init__
    raise ValueError(
builtins.ValueError: LoggingContext must be given either a name or a parent context

Cathed only when problem exists:

 twisted - 274 - CRITICAL - sentinel - Unhandled error in Deferred:
 twisted - 274 - CRITICAL - sentinel -
Traceback (most recent call last):
  File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse_s3_storage_provider-1.1-py3.8.egg/s3_storage_provider.py", line 243, in _stream_to_producer
    raise Exception("Timed out waiting to resume")
Exception: Timed out waiting to resume

Synapse installation:

DMRobertson commented 1 year ago

Tricky to advise, we'd need to investigate and it sounds like it would be involved.

Can you see a pattern between the problem and the rate of incoming media requests? E.g. are you getting lots of small requests? Maybe a few requests that upload huge files? Is there any logging on S3's side which you can correlate with the timeouts?

DMRobertson commented 1 year ago

Can you provide more logs from Synapse? Ideally I'd like to see 5 minutes of logs before and after a request, along with the request URL and the request timestamp.

DeezyTheFox commented 1 year ago

Can you provide more logs from Synapse? Ideally I'd like to see 5 minutes of logs before and after a request, along with the request URL and the request timestamp.

My synapse has been working suspiciously stable for the last week.

It's a bit difficult to collect logs for irregular problem with ~1400 users on homeserver. I'll back with detailed logs when the problem shows up.

H-Shay commented 1 year ago

I am going to close this for now as we cannot action this without logs, if the problem arises again and logs are available we can re-open.