lightningd / plugins

Community curated plugins for core-lightning
BSD 3-Clause "New" or "Revised" License
269 stars 129 forks source link

backup: sqlite3 check in on_init can cause a hang at startup #209

Closed laanwj closed 3 years ago

laanwj commented 3 years ago

The call to listconfigs inside the db_sync hook can hang indefinitely in some circumstances. See https://github.com/ElementsProject/lightning/issues/4348#issuecomment-769449638

I think it would be good to move this check to a different hook. Or remove it.

cdecker commented 3 years ago

That is indeed correct, db_write is a synchronous hook, i.e., we don't process anything else until the hook returns. This also includes RPC calls such as listconfigs. I'm surprised however that we get stuck on the init call, which is used to tell us that the RPC interface is up and where to find it.

That is unless the listconfigs call creates a new DB transaction, which happens to trigger some changes, which then get reported back to db_write, however the plugin is busy waiting for the listconfigs reply and the db_write hook hangs indefinitely. Would be interesting to see if just moving the listconfigs call to a new thread is sufficient.

laanwj commented 3 years ago

Would be interesting to see if just moving the listconfigs call to a new thread is sufficient.

It would be possible to defer the check, but as I understand the nature of the check seems to be to want to fail quickly in the case of a wrong database kind.

laanwj commented 3 years ago

I agree it's really strange. I thought for a minute that on_init was called from first invocation of the db_sync hook, but it seems it is already a separate hook.

Not entirely sure what is happening here, to be honest. Could it be a race between two calls to the plugin itself? It seems to fail too reliably for that though.

m-schmoock commented 3 years ago

Bet that's why the CI often failed/timeout on the backup plugin lately. ( @gallizoltan ) Someone already some clues? I will try to reproduce / isolate this.

m-schmoock commented 3 years ago

To reproduce one can run test_compact frequently by i.e. py.test --count 100 -x -n 16 --timeout 60 -v backup/test_backup.py -k test_compact there are deadlocking instances, see:

https://gist.github.com/m-schmoock/8ff49188399547c2d6eb5a188a68d6f9

The testcase itself unsurprisingly locks up when trying to get the nodes from the factory:

l1 = node_factory.get_node(options=opts, cleandir=False)
m-schmoock commented 3 years ago

As @laanwj pointed out this indeed 'fixed' when removing the listconfigs sqlite3 check: https://github.com/ElementsProject/lightning/issues/4348#issuecomment-769449638

This is what gdb tells us when I attach to a locked up lightningd instance:

(gdb) bt
#0  0x00007f77d0854437 in poll () from /usr/lib/libc.so.6
#1  0x000055a2795a7ee9 in daemon_poll (fds=0x55a27a36ad48, nfds=28, timeout=-1) at common/daemon.c:94
#2  0x000055a279554794 in io_poll_lightningd (fds=0x55a27a36ad48, nfds=28, timeout=-1) at lightningd/lightningd.c:672
#3  0x000055a2795ff3c9 in io_loop (timers=0x0, expired=0x0) at ccan/ccan/io/poll.c:412
#4  0x000055a27957d1af in plugins_exclusive_loop (plugins=0x55a27a443518) at lightningd/plugin.c:1857
#5  0x000055a27957ed24 in plugin_hook_db_sync (db=0x55a27a35ce38) at lightningd/plugin_hook.c:397
#6  0x000055a279586118 in db_report_changes (db=0x55a27a35ce38, final=0x0, min=0) at wallet/db.c:857
#7  0x000055a279586350 in db_commit_transaction (db=0x55a27a35ce38) at wallet/db.c:927
#8  0x000055a2795392a1 in getrawblockbyheight_callback (
    buf=0x55a27a36efb8 "\n\n{\"jsonrpc\":\"2.0\",\"id\":26,\"result\":{\"blockhash\":\"006c743d6b4d2286957acbba98a208ddd36ab2b757a68e38168e6e4f4fbfc104\",\"block\":\"00000020a7b2ffdc3d430f6ac9c7ca4fa786bd6a5ce5b1e7da61f5d48803fdf6f31b7114ad8"..., toks=0x55a27a3bdc38, idtok=0x55a27a3bdc88, call=0x55a27a38f138) at lightningd/bitcoind.c:445
#9  0x000055a279579044 in plugin_response_handle (plugin=0x55a27a348968, toks=0x55a27a3bdc38, idtok=0x55a27a3bdc88) at lightningd/plugin.c:432
#10 0x000055a279579250 in plugin_read_json_one (plugin=0x55a27a348968, complete=0x7fff4b5aef65, destroyed=0x7fff4b5aef66) at lightningd/plugin.c:538
#11 0x000055a279579416 in plugin_read_json (conn=0x55a27a35c2c8, plugin=0x55a27a348968) at lightningd/plugin.c:583
#12 0x000055a2795fc767 in next_plan (conn=0x55a27a35c2c8, plan=0x55a27a35c2e8) at ccan/ccan/io/io.c:59
#13 0x000055a2795fd2ee in do_plan (conn=0x55a27a35c2c8, plan=0x55a27a35c2e8, idle_on_epipe=false) at ccan/ccan/io/io.c:407
#14 0x000055a2795fd32c in io_ready (conn=0x55a27a35c2c8, pollflags=1) at ccan/ccan/io/io.c:417
#15 0x000055a2795ff4eb in io_loop (timers=0x55a27a30a0a8, expired=0x7fff4b5af0a0) at ccan/ccan/io/poll.c:445
#16 0x000055a27954f64c in io_loop_with_timers (ld=0x55a27a3065b8) at lightningd/io_loop_with_timers.c:24
#17 0x000055a279554fad in main (argc=21, argv=0x7fff4b5af248) at lightningd/lightningd.c:1016
m-schmoock commented 3 years ago

'fixed' by #212 - Close for now