trustlines-protocol / relay

MIT License
8 stars 7 forks source link

Relay server crashes with LoopExit #580

Closed cducrest closed 3 years ago

cducrest commented 3 years ago

When Sascha started the version 0.20.2 of the relay, it crashed with traceback:

2021-03-25 09:48:38,193 INFO     [main] trustlines: Starting relay server version 0.20.2
2021-03-25 09:48:38,217 INFO     [main] web3provider: Autodetect provider from uri ws://ws.ethereum.trustlines.tlbc.parity.trustlines-relay-0001
2021-03-25 09:48:38,219 INFO     [main] web3provider: Autodetected WebsocketProvider
2021-03-25 09:48:38,221 WARNING  [main] relay: No account configured
2021-03-25 09:48:38,221 INFO     [main] node: Assuming connected to parity node: Enabling parity-only rpc methods.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/relay/lib/python3.8/site-packages/sentry_sdk/integrations/threading.py", line 69, in run
    reraise(*_capture_exception())
  File "/opt/relay/lib/python3.8/site-packages/sentry_sdk/_compat.py", line 57, in reraise
    raise value
  File "/opt/relay/lib/python3.8/site-packages/sentry_sdk/integrations/threading.py", line 67, in run
    return old_run_func(self, *a, **kw)
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/relay/lib/python3.8/site-packages/web3/providers/websocket.py", line 38, in _start_event_loop
    loop.run_forever()
  File "/usr/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
    self._run_once()
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1823, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/relay/lib/python3.8/site-packages/gevent/selectors.py", line 201, in select
    self._ready.wait(timeout)
  File "src/gevent/event.py", line 163, in gevent._gevent_cevent.Event.wait
  File "src/gevent/_abstract_linkable.py", line 521, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait
  File "src/gevent/_abstract_linkable.py", line 487, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 490, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 442, in gevent._gevent_c_abstract_linkable.AbstractLinkable._AbstractLinkable__wait_to_be_notified
  File "src/gevent/_abstract_linkable.py", line 451, in gevent._gevent_c_abstract_linkable.AbstractLinkable._switch_to_hub
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
gevent.exceptions.LoopExit: This operation would block forever
    Hub: <Hub '' at 0x7f4b5e8c0100 epoll pending=0 ref=0 fileno=7 thread_ident=0x7f4b5e85c700>
    Handles:
[]

An important change in between 0.20.1 and 0.20.2 is I upgraded the relay to make it work with python3.8 which means I had to update gevent from 1.4.0 to 21.1.2. I can see related issues with the same problem https://github.com/gevent/gevent/issues/1698 https://github.com/kimbauters/ZIMply/issues/6 but I am not sure why we do monkey.patch_all(thread=False) with thread=False in the boot https://github.com/trustlines-protocol/relay/blob/master/src/relay/boot.py#L8

We cannot reproduce the problem locally or on the staging relay server. Ralf suggested that it could be due to Sascha running Sentry and us not. It appears we run Sentry on the devel server, I could try the update there. Ralf also mentioned the fact that web3 uses asyncio which could be a problem. Another difference is that Sascha uses a webscoket URL while we use http

cducrest commented 3 years ago

The problem was identified as coming from the use of websocket for the node rpc in the relay. Web3 uses asyncio to work with websockets and it does not cooperate well with gevent. I spent time reading about it and fiddling to try to find a solution but it is appears impossible. I believe we should drop the support for websockets and revert to http.

I still do not know why it was not a problem in the past and became a problem, but afaict, it should not be fixed by trying to make asyncio and gevent work together.