spesmilo / electrumx

Alternative implementation of spesmilo/electrum-server
MIT License
429 stars 343 forks source link

Crashing with `DaemonError: {'code': -32700, 'message': 'Parse error'}`. (RPC_PARSE_ERROR) #238

Open freerko opened 7 months ago

freerko commented 7 months ago

Hi, I am running the newest commit on Ubuntu 22.04 LTS and ElectrumX crashed with this "Fatal error on SSL transport" and "Bad file descriptor" error. Afterwards it was possible to start it again without issues, I hope there are no corrupt files. Maybe issue #22 was similar?

Nov 24 20:30:20 server electrumx_server[64013]: INFO:ElectrumX:[30450] TCP xx.xx.xx.xx:xx, 192 total
Nov 24 20:30:21 server electrumx_server[64013]: INFO:ElectrumX:[30451] SSL xx.xx.xx.xx:xx, 193 total
Nov 24 20:30:21 server electrumx_server[64013]: INFO:ElectrumX:[30452] SSL xx.xx.xx.xx:xx, 193 total
Nov 24 20:30:22 server electrumx_server[64013]: INFO:ElectrumX:[30453] SSL xx.xx.xx.xx:xx, 194 total
Nov 24 20:30:23 server electrumx_server[64013]: INFO:ElectrumX:[30454] SSL xx.xx.xx.xx:xx, 193 total
Nov 24 20:30:24 server electrumx_server[64013]: INFO:ElectrumX:[30455] SSL xx.xx.xx.xx:xx, 194 total
Nov 24 20:30:24 server electrumx_server[64013]: INFO:ElectrumX:[30455] client requested future protocol version 1.5.1 - is your software out of date?
Nov 24 20:30:24 server electrumx_server[64013]: INFO:Prefetcher:cancelled; prefetcher stopping
Nov 24 20:30:24 server electrumx_server[64013]: INFO:BlockProcessor:flushing to DB for a clean shutdown...
Nov 24 20:30:24 server electrumx_server[64013]: INFO:SessionManager:closing down server for wss://all_interfaces:50004
Nov 24 20:30:24 server electrumx_server[64013]: INFO:SessionManager:closing down server for rpc://localhost:8000
Nov 24 20:30:24 server electrumx_server[64013]: INFO:SessionManager:closing down server for tcp://all_interfaces:50001
Nov 24 20:30:24 server electrumx_server[64013]: INFO:SessionManager:closing down server for ssl://all_interfaces:50002
Nov 24 20:30:24 server electrumx_server[64013]: INFO:ElectrumX:[30373] disconnected whilst throttled
Nov 24 20:30:25 server electrumx_server[64013]: INFO:Controller:shutting down
Nov 24 20:30:25 server electrumx_server[64013]: INFO:Controller:shutdown complete
Nov 24 20:30:25 server electrumx_server[64013]: ERROR:electrumx:ElectrumX server terminated abnormally
Nov 24 20:30:25 server electrumx_server[64013]: Traceback (most recent call last):
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/EGG-INFO/scripts/electrumx_server", line 35, in main
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
Nov 24 20:30:25 server electrumx_server[64013]:     return loop.run_until_complete(main)
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
Nov 24 20:30:25 server electrumx_server[64013]:     return future.result()
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/lib/server_base.py", line 129, in run
Nov 24 20:30:25 server electrumx_server[64013]:     await server_task
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/lib/server_base.py", line 102, in serve
Nov 24 20:30:25 server electrumx_server[64013]:     await self.serve(shutdown_event)
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/controller.py", line 134, in serve
Nov 24 20:30:25 server electrumx_server[64013]:     async with OldTaskGroup() as group:
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/aiorpcx/curio.py", line 297, in __aexit__
Nov 24 20:30:25 server electrumx_server[64013]:     await self.join()
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/lib/util.py", line 370, in join
Nov 24 20:30:25 server electrumx_server[64013]:     task.result()
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/mempool.py", line 377, in keep_synchronized
Nov 24 20:30:25 server electrumx_server[64013]:     async with OldTaskGroup() as group:
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/aiorpcx/curio.py", line 297, in __aexit__
Nov 24 20:30:25 server electrumx_server[64013]:     await self.join()
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/lib/util.py", line 370, in join
Nov 24 20:30:25 server electrumx_server[64013]:     task.result()
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/mempool.py", line 259, in _refresh_hashes
Nov 24 20:30:25 server electrumx_server[64013]:     hex_hashes = await self.api.mempool_hashes()
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/daemon.py", line 248, in mempool_hashes
Nov 24 20:30:25 server electrumx_server[64013]:     return await self._send_single('getrawmempool')
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/daemon.py", line 191, in _send_single
Nov 24 20:30:25 server electrumx_server[64013]:     return await self._send(payload, processor)
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/daemon.py", line 150, in _send
Nov 24 20:30:25 server electrumx_server[64013]:     result = processor(result)
Nov 24 20:30:25 server electrumx_server[64013]:   File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/daemon.py", line 186, in processor
Nov 24 20:30:25 server electrumx_server[64013]:     raise DaemonError(err)
Nov 24 20:30:25 server electrumx_server[64013]: electrumx.server.daemon.DaemonError: {'code': -32700, 'message': 'Parse error'}
Nov 24 20:30:26 server electrumx_server[64013]: Fatal error on SSL transport
Nov 24 20:30:26 server electrumx_server[64013]: protocol: <asyncio.sslproto.SSLProtocol object at 0x7f87d62b2980>
Nov 24 20:30:26 server electrumx_server[64013]: transport: <_SelectorSocketTransport closing fd=194>
Nov 24 20:30:26 server electrumx_server[64013]: Traceback (most recent call last):
Nov 24 20:30:26 server electrumx_server[64013]:   File "/usr/lib/python3.10/asyncio/selector_events.py", line 924, in write
Nov 24 20:30:26 server electrumx_server[64013]:     n = self._sock.send(data)
Nov 24 20:30:26 server electrumx_server[64013]: OSError: [Errno 9] Bad file descriptor
Nov 24 20:30:26 server electrumx_server[64013]: During handling of the above exception, another exception occurred:
Nov 24 20:30:26 server electrumx_server[64013]: Traceback (most recent call last):
Nov 24 20:30:26 server electrumx_server[64013]:   File "/usr/lib/python3.10/asyncio/sslproto.py", line 690, in _process_write_backlog
Nov 24 20:30:26 server electrumx_server[64013]:     self._transport.write(chunk)
Nov 24 20:30:26 server electrumx_server[64013]:   File "/usr/lib/python3.10/asyncio/selector_events.py", line 930, in write
Nov 24 20:30:26 server electrumx_server[64013]:     self._fatal_error(exc, 'Fatal write error on socket transport')
Nov 24 20:30:26 server electrumx_server[64013]:   File "/usr/lib/python3.10/asyncio/selector_events.py", line 725, in _fatal_error
Nov 24 20:30:26 server electrumx_server[64013]:     self._force_close(exc)
Nov 24 20:30:26 server electrumx_server[64013]:   File "/usr/lib/python3.10/asyncio/selector_events.py", line 737, in _force_close
Nov 24 20:30:26 server electrumx_server[64013]:     self._loop.call_soon(self._call_connection_lost, exc)
Nov 24 20:30:26 server electrumx_server[64013]:   File "/usr/lib/python3.10/asyncio/base_events.py", line 753, in call_soon
Nov 24 20:30:26 server electrumx_server[64013]:     self._check_closed()
Nov 24 20:30:26 server electrumx_server[64013]:   File "/usr/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
Nov 24 20:30:26 server electrumx_server[64013]:     raise RuntimeError('Event loop is closed')
Nov 24 20:30:26 server electrumx_server[64013]: RuntimeError: Event loop is closed
Nov 24 20:30:27 server systemd[1]: electrumx.service: Deactivated successfully.
Nov 24 20:30:27 server systemd[1]: electrumx.service: Consumed 1h 16min 43.903s CPU time.
SomberNight commented 7 months ago

It is the first trace that is more interesting.

  File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/daemon.py", line 248, in mempool_hashes
    return await self._send_single('getrawmempool')
  File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/daemon.py", line 191, in _send_single
    return await self._send(payload, processor)
  File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/daemon.py", line 150, in _send
    result = processor(result)
  File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/daemon.py", line 186, in processor
    raise DaemonError(err)
electrumx.server.daemon.DaemonError: {'code': -32700, 'message': 'Parse error'}

Looks like e-x sent a getrawmempool RPC to bitcoind, and bitcoind responded with an RPC_PARSE_ERROR. This caused e-x to shutdown. (and then while shutting down, the other trace likely happened due to a race)

What version of bitcoind do you use? How are the two processes configured, e.g. is bitcoind on the same machine as electrumx?

SomberNight commented 7 months ago

Can you check the bitcoind logs around the same time the crash happened? Perhaps it logged something more about the parse error.

freerko commented 7 months ago

Its bitcoind 25.1.0 locally on the same machine configured with rpcuser/rpcpassword and rpcworkqueue of 250. Nothing special in the bitcoind logs during that time, I guess:

2023-11-24T20:10:17Z New outbound peer connected: version: 70016, blocks=818307, peer=528964 (block-relay-only) 2023-11-24T20:16:35Z New outbound peer connected: version: 70015, blocks=818307, peer=532813 (block-relay-only) 2023-11-24T20:18:18Z Flushed fee estimates to fee_estimates.dat. 2023-11-24T20:30:11Z Saw new header hash=000000000000000000001ba08686151462f26ba16563503d5ce6822f91c148c4 height=818308 2023-11-24T20:30:12Z UpdateTip: new best=000000000000000000001ba08686151462f26ba16563503d5ce6822f91c148c4 height=818308 version=0x2000c000 log2_work=94.559874 tx=925364346 date='2023- 11-24T20:29:44Z' progress=1.000000 cache=247.4MiB(1705214txo) 2023-11-24T20:33:50Z New outbound peer connected: version: 70016, blocks=818308, peer=543617 (block-relay-only) 2023-11-24T20:34:53Z Saw new header hash=0000000000000000000448495577f0d417a7d73a82001f67c879ac8155adb229 height=818309 2023-11-24T20:34:54Z UpdateTip: new best=0000000000000000000448495577f0d417a7d73a82001f67c879ac8155adb229 height=818309 version=0x2fd36000 log2_work=94.559888 tx=925366081 date='2023- 11-24T20:34:40Z' progress=1.000000 cache=250.3MiB(1728117txo)

Something else seems broken: With "electrumx_rpc getinfo" I get for both daemon height and db height the current latest blockchain height, but when I connect with Electrum to my server it tells me my server is lagging 30 blocks and https://1209k.com/bitcoin-eye/ele.php reports it as BEHIND as well.

SomberNight commented 7 months ago

electrumx.server.daemon.DaemonError: {'code': -32700, 'message': 'Parse error'}

Did this error only happen once? Or does it reproduce somehow? Let me know if it happens again, we could add a special-case for it to wait a bit and retry, and perhaps more importantly, to log more details when it happens.

Something else seems broken: With "electrumx_rpc getinfo" I get for both daemon height and db height the current latest blockchain height, but when I connect with Electrum to my server it tells me my server is lagging 30 blocks

Is this still the case or did the server recover? Is the process at 100% cpu usage?

Its bitcoind 25.1.0 locally on the same machine configured with rpcuser/rpcpassword and rpcworkqueue of 250.

How beefy is the hardware? Is there anything else connected to bitcoind besides electrumx?

smmalis37 commented 7 months ago

I have also just started seeing the height mismatch on my server, cpu usage is fine and I don't see any other errors though.

freerko commented 7 months ago

The parse error happened afterwards again and before as well, I have 5 findings in my syslog (in the old .gz ones it didn't happen):

root@server:# grep Parse /var/log/syslog* /var/log/syslog:Nov 27 16:48:00 server electrumx_server[104490]: INFO:Prefetcher:ignoring daemon error: {'code': -32700, 'message': 'Parse error'} /var/log/syslog.1:Nov 21 14:58:50 server electrumx_server[25828]: INFO:Prefetcher:ignoring daemon error: {'code': -32700, 'message': 'Parse error'} /var/log/syslog.1:Nov 24 04:59:09 server electrumx_server[57701]: electrumx.server.daemon.DaemonError: {'code': -32700, 'message': 'Parse error'} /var/log/syslog.1:Nov 24 19:53:43 server electrumx_server[64013]: INFO:Prefetcher:ignoring daemon error: {'code': -32700, 'message': 'Parse error'} /var/log/syslog.1:Nov 24 20:30:25 server electrumx_server[64013]: electrumx.server.daemon.DaemonError: {'code': -32700, 'message': 'Parse error'} root@server:#

The one on Nov 27 16:48:00 was just that single line, nothing else, no crash. On the other parse errors the server did crash another time (Nov 24 04:59) but on Nov 21 14:58 and Nov 24 19:53 it didn't crash. I've looked into Bitcoins debug.log at the times of all 5 Parse errors but never saw something special.

Nov 21 14:58:49 server electrumx_server[25828]: INFO:ElectrumX:[32041] SSL xx.xx.xx.xx:xx, 204 total Nov 21 14:58:49 server electrumx_server[25828]: INFO:ElectrumX:[32042] SSL xx.xx.xx.xx:xx, 203 total Nov 21 14:58:50 server electrumx_server[25828]: INFO:Prefetcher:ignoring daemon error: {'code': -32700, 'message': 'Parse error'} Nov 21 14:58:50 server electrumx_server[25828]: ERROR:Daemon:daemon service refused: Nov 21 14:58:50 server electrumx_server[25828]: 400 Bad Request Nov 21 14:58:50 server electrumx_server[25828]: </HEAD/> Nov 21 14:58:50 server electrumx_server[25828]: <H 1>Bad Request</H 1> Nov 21 14:58:50 server electrumx_server[25828]: </BO DY></H TML>. Retrying occasionally... Nov 21 14:58:50 server electrumx_server[25828]: INFO:Daemon:running normally Nov 21 14:58:50 server electrumx_server[25828]: INFO:ElectrumX:[32043] SSL xx.xx.xx.xx:xx, 204 total Nov 21 14:58:51 server electrumx_server[25828]: INFO:ElectrumX:[32044] SSL xx.xx.xx.xx:xx, 205 total Nov 21 14:58:51 server electrumx_server[25828]: INFO:ElectrumX:[32045] SSL xx.xx.xx.xx:xx, 206 total

On Nov 21 I still had Bitcoin version 0.24 running. Next one:

Nov 24 04:59:06 server electrumx_server[57701]: INFO:ElectrumX:[37162] SSL xx.xx.xx.xx:xx, 178 total Nov 24 04:59:07 server electrumx_server[57701]: INFO:ElectrumX:[37163] TCP xx.xx.xx.xx:xx, 174 total Nov 24 04:59:07 server electrumx_server[57701]: INFO:ElectrumX:[37164] SSL xx.xx.xx.xx:xx, 175 total Nov 24 04:59:08 server electrumx_server[57701]: INFO:ElectrumX:[37165] SSL xx.xx.xx.xx:xx, 176 total Nov 24 04:59:08 server electrumx_server[57701]: INFO:Prefetcher:cancelled; prefetcher stopping Nov 24 04:59:08 server electrumx_server[57701]: INFO:BlockProcessor:flushing to DB for a clean shutdown... Nov 24 04:59:08 server electrumx_server[57701]: INFO:SessionManager:closing down server for ssl://all_interfaces:50002 Nov 24 04:59:08 server electrumx_server[57701]: INFO:SessionManager:closing down server for tcp://all_interfaces:50001 Nov 24 04:59:08 server electrumx_server[57701]: INFO:SessionManager:closing down server for rpc://localhost:8000 Nov 24 04:59:08 server electrumx_server[57701]: INFO:SessionManager:closing down server for wss://all_interfaces:50004 Nov 24 04:59:08 server electrumx_server[57701]: INFO:ElectrumX:[37166] SSL xx.xx.xx.xx:xx, 83 total Nov 24 04:59:08 server electrumx_server[57701]: INFO:ElectrumX:[37167] SSL xx.xx.xx.xx:xx, 18 total Nov 24 04:59:09 server electrumx_server[57701]: INFO:ElectrumX:[36267] disconnected whilst throttled Nov 24 04:59:09 server electrumx_server[57701]: INFO:ElectrumX:[36281] disconnected whilst throttled Nov 24 04:59:09 server electrumx_server[57701]: INFO:ElectrumX:[36355] disconnected whilst throttled Nov 24 04:59:09 server electrumx_server[57701]: INFO:Controller:shutting down Nov 24 04:59:09 server electrumx_server[57701]: INFO:Controller:shutdown complete Nov 24 04:59:09 server electrumx_server[57701]: ERROR:electrumx:ElectrumX server terminated abnormally Nov 24 04:59:09 server electrumx_server[57701]: Traceback (most recent call last): Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/EGG-INFO/scripts/electrumx_server", line 35, in main Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run Nov 24 04:59:09 server electrumx_server[57701]: return loop.run_until_complete(main) Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete Nov 24 04:59:09 server electrumx_server[57701]: return future.result() Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/lib/server_base.py", line 129, in run Nov 24 04:59:09 server electrumx_server[57701]: await server_task Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/lib/server_base.py", line 102, in serve Nov 24 04:59:09 server electrumx_server[57701]: await self.serve(shutdown_event) Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/controller.py", line 134, in serve Nov 24 04:59:09 server electrumx_server[57701]: async with OldTaskGroup() as group: Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/aiorpcx/curio.py", line 297, in aexit Nov 24 04:59:09 server electrumx_server[57701]: await self.join() Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/lib/util.py", line 370, in join Nov 24 04:59:09 server electrumx_server[57701]: task.result() Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/mempool.py", line 377, in keep_synchronized Nov 24 04:59:09 server electrumx_server[57701]: async with OldTaskGroup() as group: Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/aiorpcx/curio.py", line 297, in aexit Nov 24 04:59:09 server electrumx_server[57701]: await self.join() Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/lib/util.py", line 370, in join Nov 24 04:59:09 server electrumx_server[57701]: task.result() Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/mempool.py", line 259, in _refresh_hashes Nov 24 04:59:09 server electrumx_server[57701]: hex_hashes = await self.api.mempool_hashes() Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/daemon.py", line 248, in mempool_hashes Nov 24 04:59:09 server electrumx_server[57701]: return await self._send_single('getrawmempool') Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/daemon.py", line 191, in _send_single Nov 24 04:59:09 server electrumx_server[57701]: return await self._send(payload, processor) Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/daemon.py", line 150, in _send Nov 24 04:59:09 server electrumx_server[57701]: result = processor(result) Nov 24 04:59:09 server electrumx_server[57701]: File "/usr/local/lib/python3.10/dist-packages/e_x-1.16.0-py3.10.egg/electrumx/server/daemon.py", line 186, in processor Nov 24 04:59:09 server electrumx_server[57701]: raise DaemonError(err) Nov 24 04:59:09 server electrumx_server[57701]: electrumx.server.daemon.DaemonError: {'code': -32700, 'message': 'Parse error'} Nov 24 04:59:11 server systemd[1]: electrumx.service: Deactivated successfully. Nov 24 04:59:11 server systemd[1]: electrumx.service: Consumed 1h 14min 12.093s CPU time. Nov 24 15:47:11 server systemd[1]: Started Electrumx. Nov 24 15:47:12 server electrumx_server[64013]: INFO:electrumx:ElectrumX server starting

On Nov 24 I already had Bitcoin version 0.25. Next one:

Nov 24 19:53:35 server electrumx_server[64013]: INFO:ElectrumX:[26623] SSL xx.xx.xx.xx:xx, 197 total Nov 24 19:53:36 server electrumx_server[64013]: INFO:ElectrumX:[26624] SSL xx.xx.xx.xx:xx, 198 total Nov 24 19:53:36 server electrumx_server[64013]: INFO:ElectrumX:[26625] SSL xx.xx.xx.xx:xx, 199 total Nov 24 19:53:39 server electrumx_server[64013]: INFO:ElectrumX:[26626] SSL xx.xx.xx.xx:xx, 198 total Nov 24 19:53:41 server electrumx_server[64013]: INFO:ElectrumX:[26627] SSL xx.xx.xx.xx:xx, 198 total Nov 24 19:53:41 server electrumx_server[64013]: INFO:ElectrumX:[26628] SSL xx.xx.xx.xx:xx, 199 total Nov 24 19:53:42 server electrumx_server[64013]: INFO:ElectrumX:[6031] changing task concurrency from 10 to 0 Nov 24 19:53:42 server electrumx_server[64013]: INFO:ElectrumX:[13921] changing task concurrency from 10 to 0 Nov 24 19:53:42 server electrumx_server[64013]: INFO:ElectrumX:[15866] changing task concurrency from 10 to 0 Nov 24 19:53:42 server electrumx_server[64013]: INFO:ElectrumX:[16567] changing task concurrency from 10 to 0 Nov 24 19:53:42 server electrumx_server[64013]: INFO:ElectrumX:[16745] changing task concurrency from 10 to 0 Nov 24 19:53:42 server electrumx_server[64013]: INFO:ElectrumX:[18720] changing task concurrency from 10 to 0 Nov 24 19:53:42 server electrumx_server[64013]: INFO:ElectrumX:[21444] changing task concurrency from 10 to 0 Nov 24 19:53:42 server electrumx_server[64013]: INFO:ElectrumX:[25922] changing task concurrency from 10 to 0 Nov 24 19:53:42 server electrumx_server[64013]: INFO:ElectrumX:[26350] changing task concurrency from 10 to 0 Nov 24 19:53:42 server electrumx_server[64013]: INFO:ElectrumX:[26629] TCP xx.xx.xx.xx:xx, 198 total Nov 24 19:53:42 server electrumx_server[64013]: INFO:ElectrumX:[26630] SSL xx.xx.xx.xx:xx, 199 total Nov 24 19:53:43 server electrumx_server[64013]: INFO:SessionManager:closing stale session ids 22163, 25412 Nov 24 19:53:43 server electrumx_server[64013]: INFO:ElectrumX:[26631] SSL xx.xx.xx.xx:xx, 197 total Nov 24 19:53:43 server electrumx_server[64013]: INFO:ElectrumX:[26632] SSL xx.xx.xx.xx:xx, 198 total Nov 24 19:53:43 server electrumx_server[64013]: INFO:Prefetcher:ignoring daemon error: {'code': -32700, 'message': 'Parse error'} Nov 24 19:53:43 server electrumx_server[64013]: ERROR:Daemon:daemon service refused: Nov 24 19:53:43 server electrumx_server[64013]: 400 Bad Request</TIT LE> Nov 24 19:53:43 server electrumx_server[64013]: </HE AD> Nov 24 19:53:43 server electrumx_server[64013]: <H 1>Bad Request</H 1> Nov 24 19:53:43 server electrumx_server[64013]: </BO DY></HT ML>. Retrying occasionally... Nov 24 19:53:43 server electrumx_server[64013]: INFO:ElectrumX:[26633] SSL xx.xx.xx.xx:xx, 199 total Nov 24 19:53:43 server electrumx_server[64013]: INFO:ElectrumX:[26634] SSL xx.xx.xx.xx:xx, 200 total Nov 24 19:53:43 server electrumx_server[64013]: INFO:Daemon:running normally Nov 24 19:53:44 server electrumx_server[64013]: INFO:ElectrumX:[25922] closing session over res usage. ip: 185.220.101.xxx. groups: ['185.220.101.0/24'] Nov 24 19:53:44 server electrumx_server[64013]: INFO:MemPool:91,011 txs 72.94 MB touching 389,616 addresses Nov 24 19:53:44 server electrumx_server[64013]: INFO:ElectrumX:[25922] disconnected whilst throttled Nov 24 19:53:45 server electrumx_server[64013]: INFO:ElectrumX:[26635] SSL xx.xx.xx.xx:xx, 193 total Nov 24 19:53:45 server electrumx_server[64013]: INFO:ElectrumX:[26636] SSL xx.xx.xx.xx:xx, 194 total Nov 24 19:53:46 server electrumx_server[64013]: INFO:ElectrumX:[26637] SSL xx.xx.xx.xx:xx, 193 total Nov 24 19:53:46 server electrumx_server[64013]: INFO:ElectrumX:[26638] SSL xx.xx.xx.xx:xx, 194 total Nov 24 19:53:46 server electrumx_server[64013]: INFO:ElectrumX:[26639] SSL xx.xx.xx.xx:xx, 194 total

The other issue: Electrumx apparently didn't parse any new blocks at all and after a day or so I restarted it and now they are up to date again. During the lag I didn't notice any high CPU usage nor RAM or I/O. Its a VM with 30GB RAM, 8 Cores AMD EPYC 7282 and 800GB SSD and only bitcoind, tor and electrumx are running. It transfers about 700-800GB (total, rx+tx) per day. Tor consumes about 50% of one core, bitcoind about 50% of one core and electrum about 15-30% of one core. I never see IO wait.

Edit: Electrumx stopped processing blocks again without anything interesting in the logfile during that time. Block processing usually takes between 1s and 11s on this system and there is no high CPU nor IO load. In the electrumx log I see about 10 "timed out after 30 secs" messages for transaction.get or transaction.broadcast per minute - its the about the same amount in the beginning when still processing blocks and afterwards without processing blocks. Currently "electrumx_rpc getinfo" gives me:

"daemon height": 818745,
"db height": 818735,

Where 818735 is the height reported when connecting via Electrum as well. At the same time "bitcoind getblockcount" on that system reports the correct current block 818756. With "bitcoin-cli getrpcinfo" I see only the active getrpcinfo call, nothing else.

SomberNight commented 7 months ago

The parse error happened afterwards again and before as well, I have 5 findings in my syslog (in the old .gz ones it didn't happen)

Ok, so it was not a one-off for you then. In that case, I've pushed a small change, please try with that: https://github.com/spesmilo/electrumx/commit/314893655a7cc0bfcc216b07900fa77b5f66e148

The other issue: Electrumx apparently didn't parse any new blocks at all and after a day or so I restarted it and now they are up to date again. During the lag I didn't notice any high CPU usage nor RAM or I/O. [...] Edit: Electrumx stopped processing blocks again without anything interesting in the logfile during that time.

Perhaps this is not related at all then.

Still I am interested in what will get logged re RPC_PARSE_ERROR with the new commit.

freerko commented 7 months ago

Thanks! I am now running the current main branch.

biatwc commented 6 months ago

I have seen similar on an LTC electrumx, I tried to get the full aiohttp error with rich.console which I've attached. In my case the LTC RPC node is on a separate VM and a wireguard tunnel linking the electrumx (there are 4 electrumx all using the same LTC RPC node - I've increased rpcworkqueue and rpcthreads from the defaults). What tends to happen is the prefetcher stops after a parse error and so electrumx falls behind tip and loses peers.

I have seen the prefetcher stopping without a parse error, it will just stop and I don't see any error around that happening to suggest why. Is there something that could be done to make the prefetcher more resilient when the RPC node isn't located on the same VM or LAN segment?

The attachment should be read bottom to top. err_out.txt


edit: attachment inlined and reversed:

log ``` Dec 08 11:20: INFO:Prefetcher:ignoring daemon error: {'code': -32700, 'message': 'Parse error'} Dec 08 11:20: ERROR:Daemon:daemon service refused: Dec 08 11:20: 400 Bad Request Dec 08 11:20: Dec 08 11:20:

Bad Request

Dec 08 11:20: . Retrying occasionally... Dec 08 11:20: ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ Dec 08 11:20: │ /opt/electrumx/env/lib64/python3.9/site-packages/aiohttp/client_reqrep.p │ Dec 08 11:20: │ y:609 in write_bytes │ Dec 08 11:20: │ │ Dec 08 11:20: │ 606 │ │ │ Dec 08 11:20: │ 607 │ │ return True │ Dec 08 11:20: │ 608 │ │ Dec 08 11:20: │ ❱ 609 │ async def write_bytes( │ Dec 08 11:20: │ 610 │ │ self, writer: AbstractStreamWriter, conn: "Connection" │ Dec 08 11:20: │ 611 │ ) -> None: │ Dec 08 11:20: │ 612 │ │ """Support coroutines that yields bytes objects.""" │ Dec 08 11:20: │ │ Dec 08 11:20: │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ Dec 08 11:20: │ │ conn = Connection │ │ Dec 08 11:20: │ │ self = │ │ Dec 08 11:20: │ │ writer = │ │ Dec 08 11:20: │ ╰──────────────────────────────────────────────────────────────────────────╯ │ Dec 08 11:20: ╰──────────────────────────────────────────────────────────────────────────────╯ Dec 08 11:20: CancelledError Dec 08 11:20: During handling of the above exception, another exception occurred: Dec 08 11:20: ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ Dec 08 11:20: │ /opt/electrumx/env/lib/python3.9/site-packages/electrumx/server/block_pr │ Dec 08 11:20: │ ocessor.py:71 in main_loop │ Dec 08 11:20: │ │ Dec 08 11:20: │ 68 │ │ │ try: │ Dec 08 11:20: │ 69 │ │ │ │ # Sleep a while if there is nothing to prefetch │ Dec 08 11:20: │ 70 │ │ │ │ await self.refill_event.wait() │ Dec 08 11:20: │ ❱ 71 │ │ │ │ if not await self._prefetch_blocks(): │ Dec 08 11:20: │ 72 │ │ │ │ │ await asyncio.sleep(self.polling_delay) │ Dec 08 11:20: │ 73 │ │ │ except DaemonError as e: │ Dec 08 11:20: │ 74 │ │ │ │ self.logger.info(f'ignoring daemon error: {e}') │ Dec 08 11:20: │ │ Dec 08 11:20: │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ Dec 08 11:20: │ │ bp_height = 2589692 │ │ Dec 08 11:20: │ │ console = │ │ Dec 08 11:20: │ │ e = CancelledError() │ │ Dec 08 11:20: │ │ self = │ │ Dec 08 11:20: │ ╰──────────────────────────────────────────────────────────────────────────╯ │ Dec 08 11:20: │ │ Dec 08 11:20: │ /opt/electrumx/env/lib/python3.9/site-packages/electrumx/server/block_pr │ Dec 08 11:20: │ ocessor.py:120 in _prefetch_blocks │ Dec 08 11:20: │ │ Dec 08 11:20: │ 117 │ │ Repeats until the queue is full or caught up. │ Dec 08 11:20: │ 118 │ │ ''' │ Dec 08 11:20: │ 119 │ │ daemon = self.daemon │ Dec 08 11:20: │ ❱ 120 │ │ daemon_height = await daemon.height() │ Dec 08 11:20: │ 121 │ │ async with self.semaphore: │ Dec 08 11:20: │ 122 │ │ │ while self.cache_size < self.min_cache_size: │ Dec 08 11:20: │ 123 │ │ │ │ first = self.fetched_height + 1 │ Dec 08 11:20: │ │ Dec 08 11:20: │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ Dec 08 11:20: │ │ daemon = │ │ Dec 08 11:20: │ │ self = │ │ Dec 08 11:20: │ ╰──────────────────────────────────────────────────────────────────────────╯ │ Dec 08 11:20: │ │ Dec 08 11:20: │ /opt/electrumx/env/lib/python3.9/site-packages/electrumx/server/daemon.p │ Dec 08 11:20: │ y:302 in height │ Dec 08 11:20: │ │ Dec 08 11:20: │ 299 │ │ Dec 08 11:20: │ 300 │ async def height(self): │ Dec 08 11:20: │ 301 │ │ '''Query the daemon for its current height.''' │ Dec 08 11:20: │ ❱ 302 │ │ self._height = await self._send_single('getblockcount') │ Dec 08 11:20: │ 303 │ │ return self._height │ Dec 08 11:20: │ 304 │ │ Dec 08 11:20: │ 305 │ def cached_height(self): │ Dec 08 11:20: │ │ Dec 08 11:20: │ ╭───────────────────────────── locals ─────────────────────────────╮ │ Dec 08 11:20: │ │ self = │ │ Dec 08 11:20: │ ╰──────────────────────────────────────────────────────────────────╯ │ Dec 08 11:20: │ │ Dec 08 11:20: │ /opt/electrumx/env/lib/python3.9/site-packages/electrumx/server/daemon.p │ Dec 08 11:20: │ y:191 in _send_single │ Dec 08 11:20: │ │ Dec 08 11:20: │ 188 │ │ payload = {'method': method, 'id': next(self.id_counter)} │ Dec 08 11:20: │ 189 │ │ if params: │ Dec 08 11:20: │ 190 │ │ │ payload['params'] = params │ Dec 08 11:20: │ ❱ 191 │ │ return await self._send(payload, processor) │ Dec 08 11:20: │ 192 │ │ Dec 08 11:20: │ 193 │ async def _send_vector(self, method, params_iterable, replace_errs │ Dec 08 11:20: │ 194 │ │ '''Send several requests of the same method. │ Dec 08 11:20: │ │ Dec 08 11:20: │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ Dec 08 11:20: │ │ method = 'getblockcount' │ │ Dec 08 11:20: │ │ params = None │ │ Dec 08 11:20: │ │ payload = {'method': 'getblockcount', 'id': 75371} │ │ Dec 08 11:20: │ │ processor = .processor at │ │ Dec 08 11:20: │ │ 0x7f2842a668b0> │ │ Dec 08 11:20: │ │ self = │ │ Dec 08 11:20: │ ╰──────────────────────────────────────────────────────────────────────────╯ │ Dec 08 11:20: │ │ Dec 08 11:20: │ /opt/electrumx/env/lib/python3.9/site-packages/electrumx/server/daemon.p │ Dec 08 11:20: │ y:149 in _send │ Dec 08 11:20: │ │ Dec 08 11:20: │ 146 │ │ retry = self.init_retry │ Dec 08 11:20: │ 147 │ │ while True: │ Dec 08 11:20: │ 148 │ │ │ try: │ Dec 08 11:20: │ ❱ 149 │ │ │ │ result = await self._send_data(data) │ Dec 08 11:20: │ 150 │ │ │ │ result = processor(result) │ Dec 08 11:20: │ 151 │ │ │ │ if on_good_message: │ Dec 08 11:20: │ 152 │ │ │ │ │ self.logger.info(on_good_message) │ Dec 08 11:20: │ │ Dec 08 11:20: │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ Dec 08 11:20: │ │ data = '{"method": "getblockcount", "id": 75371}' │ │ Dec 08 11:20: │ │ last_error_log = 1776533.851677362 │ │ Dec 08 11:20: │ │ log_error = .log_error at │ │ Dec 08 11:20: │ │ 0x7f2842a66550> │ │ Dec 08 11:20: │ │ on_good_message = 'running normally' │ │ Dec 08 11:20: │ │ payload = {'method': 'getblockcount', 'id': 75371} │ │ Dec 08 11:20: │ │ processor = .processor at │ │ Dec 08 11:20: │ │ 0x7f2842a668b0> │ │ Dec 08 11:20: │ │ retry = 0.5 │ │ Dec 08 11:20: │ │ self = │ │ Dec 08 11:20: │ ╰──────────────────────────────────────────────────────────────────────────╯ │ Dec 08 11:20: │ │ Dec 08 11:20: │ /opt/electrumx/env/lib/python3.9/site-packages/electrumx/server/daemon.p │ Dec 08 11:20: │ y:123 in _send_data │ Dec 08 11:20: │ │ Dec 08 11:20: │ 120 │ │ │ async with self.session.post(self.current_url(), data=data │ Dec 08 11:20: │ 121 │ │ │ │ kind = resp.headers.get('Content-Type', None) │ Dec 08 11:20: │ 122 │ │ │ │ if kind == 'application/json': │ Dec 08 11:20: │ ❱ 123 │ │ │ │ │ return await resp.json(loads=json_deserialize) │ Dec 08 11:20: │ 124 │ │ │ │ text = await resp.text() │ Dec 08 11:20: │ 125 │ │ │ │ text = text.strip() or resp.reason │ Dec 08 11:20: │ 126 │ │ │ │ raise ServiceRefusedError(text) │ Dec 08 11:20: │ │ Dec 08 11:20: │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ Dec 08 11:20: │ │ data = '{"method": "getblockcount", "id": 75371}' │ │ Dec 08 11:20: │ │ kind = 'application/json' │ │ Dec 08 11:20: │ │ resp = │ │ Dec 08 11:20: │ │ │ │ Dec 08 11:20: │ │ self = │ │ Dec 08 11:20: │ ╰──────────────────────────────────────────────────────────────────────────╯ │ Dec 08 11:20: │ │ Dec 08 11:20: │ /opt/electrumx/env/lib64/python3.9/site-packages/aiohttp/client_reqrep.p │ Dec 08 11:20: │ y:1161 in json │ Dec 08 11:20: │ │ Dec 08 11:20: │ 1158 │ ) -> Any: │ Dec 08 11:20: │ 1159 │ │ """Read and decodes JSON response.""" │ Dec 08 11:20: │ 1160 │ │ if self._body is None: │ Dec 08 11:20: │ ❱ 1161 │ │ │ await self.read() │ Dec 08 11:20: │ 1162 │ │ │ Dec 08 11:20: │ 1163 │ │ if content_type: │ Dec 08 11:20: │ 1164 │ │ │ ctype = self.headers.get(hdrs.CONTENT_TYPE, "").lower() │ Dec 08 11:20: │ │ Dec 08 11:20: │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ Dec 08 11:20: │ │ content_type = 'application/json' │ │ Dec 08 11:20: │ │ encoding = None │ │ Dec 08 11:20: │ │ loads = │ │ Dec 08 11:20: │ │ self = │ │ Dec 08 11:20: │ │ │ │ Dec 08 11:20: │ ╰──────────────────────────────────────────────────────────────────────────╯ │ Dec 08 11:20: │ │ Dec 08 11:20: │ /opt/electrumx/env/lib64/python3.9/site-packages/aiohttp/client_reqrep.p │ Dec 08 11:20: │ y:1114 in read │ Dec 08 11:20: │ │ Dec 08 11:20: │ 1111 │ │ elif self._released: # Response explicitly released │ Dec 08 11:20: │ 1112 │ │ │ raise ClientConnectionError("Connection closed") │ Dec 08 11:20: │ 1113 │ │ │ Dec 08 11:20: │ ❱ 1114 │ │ await self._wait_released() # Underlying connection released │ Dec 08 11:20: │ 1115 │ │ return self._body # type: ignore[no-any-return] │ Dec 08 11:20: │ 1116 │ │ Dec 08 11:20: │ 1117 │ def get_encoding(self) -> str: │ Dec 08 11:20: │ │ Dec 08 11:20: │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ Dec 08 11:20: │ │ self = │ │ Dec 08 11:20: │ │ │ │ Dec 08 11:20: │ ╰──────────────────────────────────────────────────────────────────────────╯ │ Dec 08 11:20: │ │ Dec 08 11:20: │ /opt/electrumx/env/lib64/python3.9/site-packages/aiohttp/client_reqrep.p │ Dec 08 11:20: │ y:1080 in _wait_released │ Dec 08 11:20: │ │ Dec 08 11:20: │ 1077 │ │ Dec 08 11:20: │ 1078 │ async def _wait_released(self) -> None: │ Dec 08 11:20: │ 1079 │ │ if self._writer is not None: │ Dec 08 11:20: │ ❱ 1080 │ │ │ await self._writer │ Dec 08 11:20: │ 1081 │ │ self._release_connection() │ Dec 08 11:20: │ 1082 │ │ Dec 08 11:20: │ 1083 │ def _cleanup_writer(self) -> None: │ Dec 08 11:20: │ │ Dec 08 11:20: │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ Dec 08 11:20: │ │ self = │ │ Dec 08 11:20: │ │ │ │ Dec 08 11:20: │ ╰──────────────────────────────────────────────────────────────────────────╯ │ Dec 08 11:20: ╰──────────────────────────────────────────────────────────────────────────────╯ Dec 08 11:20: CancelledError Dec 08 11:20: INFO:Prefetcher:cancelled; prefetcher stopping ```
SomberNight commented 6 months ago

@biatwc in your log, there is:

  1. INFO:Prefetcher:ignoring daemon error: {'code': -32700, 'message': 'Parse error'}
  2. ERROR:Daemon:daemon service refused
  3. and then there is a CancelledError

The first two are handled properly I think, they are just logged and then we retry. I am not sure where the cancellation originates from -- that is what ultimately kills the prefetcher. Some more log lines before/after might help.

biatwc commented 6 months ago

I've looked at the logs around the time of the CancelledError on a number of occasions and there is nothing other than the usual INFO level INFO:ElectrumX:[26054] TCP [::1]:46343, 676 total. No errors, nothing out of the ordinary. The various tunnels I've tried over time remain up and traffic flows fine. What does CancelledError actually mean in this context - the RPC daemon tears down all tcp sessions from the electrumx node? Can electrumx catch CancalledError and retry, or mark the daemon as temporarily unavailable and retry occasionally? Would that help flush out what is going on? Sorry I can't be more help.

SomberNight commented 6 months ago

How often does this bug happen for you? Is it only the prefetcher stopping, is the process otherwise still running and serving sessions? I guess you don't have a reliable way to reproduce, do you? :P

What does CancelledError actually mean in this context - the RPC daemon tears down all tcp sessions from the electrumx node? Can electrumx catch CancalledError and retry, or mark the daemon as temporarily unavailable and retry occasionally? Would that help flush out what is going on?

It can be a lot of things. For example, if you intentionally shut down electrumx in a clean way, that too will happen via cancellations (explicit calls to task.cancel and multiple CancelledErrors propagating out and caught at some locations), so we can't just catch CancelledError and retry :/