Closed laanwj closed 3 years ago
That is indeed correct, db_write
is a synchronous hook, i.e., we don't process anything else until the hook returns. This also includes RPC calls such as listconfigs
. I'm surprised however that we get stuck on the init
call, which is used to tell us that the RPC interface is up and where to find it.
That is unless the listconfigs
call creates a new DB transaction, which happens to trigger some changes, which then get reported back to db_write
, however the plugin is busy waiting for the listconfigs
reply and the db_write
hook hangs indefinitely. Would be interesting to see if just moving the listconfigs
call to a new thread is sufficient.
Would be interesting to see if just moving the listconfigs call to a new thread is sufficient.
It would be possible to defer the check, but as I understand the nature of the check seems to be to want to fail quickly in the case of a wrong database kind.
I agree it's really strange. I thought for a minute that on_init
was called from first invocation of the db_sync
hook, but it seems it is already a separate hook.
Not entirely sure what is happening here, to be honest. Could it be a race between two calls to the plugin itself? It seems to fail too reliably for that though.
Bet that's why the CI often failed/timeout on the backup plugin lately. ( @gallizoltan ) Someone already some clues? I will try to reproduce / isolate this.
To reproduce one can run test_compact
frequently by i.e. py.test --count 100 -x -n 16 --timeout 60 -v backup/test_backup.py -k test_compact
there are deadlocking instances, see:
https://gist.github.com/m-schmoock/8ff49188399547c2d6eb5a188a68d6f9
The testcase itself unsurprisingly locks up when trying to get the nodes from the factory:
l1 = node_factory.get_node(options=opts, cleandir=False)
As @laanwj pointed out this indeed 'fixed' when removing the listconfigs sqlite3 check: https://github.com/ElementsProject/lightning/issues/4348#issuecomment-769449638
This is what gdb tells us when I attach to a locked up lightningd
instance:
(gdb) bt
#0 0x00007f77d0854437 in poll () from /usr/lib/libc.so.6
#1 0x000055a2795a7ee9 in daemon_poll (fds=0x55a27a36ad48, nfds=28, timeout=-1) at common/daemon.c:94
#2 0x000055a279554794 in io_poll_lightningd (fds=0x55a27a36ad48, nfds=28, timeout=-1) at lightningd/lightningd.c:672
#3 0x000055a2795ff3c9 in io_loop (timers=0x0, expired=0x0) at ccan/ccan/io/poll.c:412
#4 0x000055a27957d1af in plugins_exclusive_loop (plugins=0x55a27a443518) at lightningd/plugin.c:1857
#5 0x000055a27957ed24 in plugin_hook_db_sync (db=0x55a27a35ce38) at lightningd/plugin_hook.c:397
#6 0x000055a279586118 in db_report_changes (db=0x55a27a35ce38, final=0x0, min=0) at wallet/db.c:857
#7 0x000055a279586350 in db_commit_transaction (db=0x55a27a35ce38) at wallet/db.c:927
#8 0x000055a2795392a1 in getrawblockbyheight_callback (
buf=0x55a27a36efb8 "\n\n{\"jsonrpc\":\"2.0\",\"id\":26,\"result\":{\"blockhash\":\"006c743d6b4d2286957acbba98a208ddd36ab2b757a68e38168e6e4f4fbfc104\",\"block\":\"00000020a7b2ffdc3d430f6ac9c7ca4fa786bd6a5ce5b1e7da61f5d48803fdf6f31b7114ad8"..., toks=0x55a27a3bdc38, idtok=0x55a27a3bdc88, call=0x55a27a38f138) at lightningd/bitcoind.c:445
#9 0x000055a279579044 in plugin_response_handle (plugin=0x55a27a348968, toks=0x55a27a3bdc38, idtok=0x55a27a3bdc88) at lightningd/plugin.c:432
#10 0x000055a279579250 in plugin_read_json_one (plugin=0x55a27a348968, complete=0x7fff4b5aef65, destroyed=0x7fff4b5aef66) at lightningd/plugin.c:538
#11 0x000055a279579416 in plugin_read_json (conn=0x55a27a35c2c8, plugin=0x55a27a348968) at lightningd/plugin.c:583
#12 0x000055a2795fc767 in next_plan (conn=0x55a27a35c2c8, plan=0x55a27a35c2e8) at ccan/ccan/io/io.c:59
#13 0x000055a2795fd2ee in do_plan (conn=0x55a27a35c2c8, plan=0x55a27a35c2e8, idle_on_epipe=false) at ccan/ccan/io/io.c:407
#14 0x000055a2795fd32c in io_ready (conn=0x55a27a35c2c8, pollflags=1) at ccan/ccan/io/io.c:417
#15 0x000055a2795ff4eb in io_loop (timers=0x55a27a30a0a8, expired=0x7fff4b5af0a0) at ccan/ccan/io/poll.c:445
#16 0x000055a27954f64c in io_loop_with_timers (ld=0x55a27a3065b8) at lightningd/io_loop_with_timers.c:24
#17 0x000055a279554fad in main (argc=21, argv=0x7fff4b5af248) at lightningd/lightningd.c:1016
'fixed' by #212 - Close for now
The call to
listconfigs
inside thedb_sync
hook can hang indefinitely in some circumstances. See https://github.com/ElementsProject/lightning/issues/4348#issuecomment-769449638I think it would be good to move this check to a different hook. Or remove it.