Closed tawmaz closed 2 years ago
Testing with api=0, node crashes again without an error in the log within 5000 blocks.
Testing with wsuse=0, issue seems to be resolved and node is able to sync past 100k blocks without issue so far. I have not seen this issue when syncing a node behind a gateway, only when the IP is public or ports are forwarded to the node.
try not compiling in the upnp library? I'm not using it on my atom low memory box. I'm also using the newer berkeley DB (v5?)
The crash appears to be a segmentation fault that occurs while syncing with wsuse=1 in the config file. I reproduced it under gbd but the SSH connection timed out so I was not able to investigate further. Restarting the process to get another repro.
Call stack from the sementation fault:
Thread 71 "pocketcoind" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f26cbf13700 (LWP 26241)]
SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}::operator()() const (this=0x7f26cbf12670) at ./websocket/ws.h:227
227 ./websocket/ws.h: No such file or directory.
(gdb) bt
#0 SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}::operator()() const (this=0x7f26cbf12670) at ./websocket/ws.h:227
#1 boost::asio::asio_handler_invoke<SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}>(SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}&, ...) (function=...) at /usr/include/boost/asio/handler_invoke_hook.hpp:69
#2 boost_asio_handler_invoke_helpers::invoke<SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}, {lambda()#1}>(SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}&, {lambda()#1}&) (context=..., function=...)
at /usr/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#3 boost::asio::detail::handler_work<SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}, boost::asio::system_executor, SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}>::complete<{lambda()#1}>({lambda()#1}&, {lambda()#1}&) (
handler=..., function=..., this=<synthetic pointer>) at /usr/include/boost/asio/detail/handler_work.hpp:100
#4 boost::asio::detail::completion_handler<SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (owner=<optimized out>, base=<optimized out>) at /usr/include/boost/asio/detail/completion_handler.hpp:70
#5 0x0000560af0103ee2 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=0, ec=..., owner=0x7f27380d1a50,
this=<optimized out>) at /usr/include/boost/asio/detail/scheduler_operation.hpp:40
#6 boost::asio::detail::strand_service::do_complete (owner=0x7f27380d1a50, base=0x7f273828d130, ec=...)
at /usr/include/boost/asio/detail/impl/strand_service.ipp:168
#7 0x0000560af01023a6 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=0, ec=..., owner=0x7f27380d1a50,
this=0x7f273828d130) at /usr/include/boost/asio/detail/scheduler_operation.hpp:40
#8 boost::asio::detail::scheduler::do_run_one (ec=..., this_thread=..., lock=..., this=<optimized out>)
at /usr/include/boost/asio/detail/impl/scheduler.ipp:447
#9 boost::asio::detail::scheduler::run (this=0x7f27380d1a50, ec=...) at /usr/include/boost/asio/detail/impl/scheduler.ipp:200
#10 0x0000560af00ecd09 in boost::asio::io_context::run (this=<optimized out>) at /usr/include/boost/asio/impl/io_context.ipp:63
#11 SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::accept_and_run (
this=0x7f26cbf12960) at ./websocket/ws.h:457
#12 SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::start (this=0x7f26cbf12960)
at ./websocket/ws.h:468
#13 StartWS () at init.cpp:1599
#14 0x0000560af00f6396 in std::__invoke_impl<void, void (*)()> (__f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89
#15 std::__invoke<void (*)()> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95
#16 std::thread::_Invoker<std::tuple<void (*)()> >::_M_invoke<0ul> (this=<optimized out>) at /usr/include/c++/9/thread:244
#17 std::thread::_Invoker<std::tuple<void (*)()> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
#18 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)()> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195
#19 0x00007f2741f14de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#20 0x00007f2742642609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#21 0x00007f2741c02293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Segmentation fault occurs here: https://github.com/pocketnetteam/pocketnet.core/blob/0.20/src/websocket/ws.h#L227
Crashes because "*self" is null gdb:
(gdb) frame 0
#0 SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}::operator()() const (this=0x7f26cbf12670) at ./websocket/ws.h:227
227 in ./websocket/ws.h
(gdb) print self
$1 = std::shared_ptr<SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection> (use count 327682, weak count 458758) = {get() = 0x0}
(gdb) print *self
Cannot access memory at address 0x0
I am extremely confused by the stack - the call to send messages via a web socket is initiated here: src/validation.cpp:3139
Until we figure out the problem, we can try to disable this functionality with a simple check: if (now() - chainActive.Tip().nTime > 3600s) return;
There is also not enough blocking for the NotifiWSClients method in validation.cpp:3139 and logics in init.cpp:1521 To do this, we need to implement a class with a dedicated thread and a queue of blocks for processing in order to exclude the process of sending notifications from the cs_main stream
send is being called from here: https://github.com/pocketnetteam/pocketnet.core/blob/0.20/src/validation.cpp#L3422 I added some debug tracing to be sure. I built with --enable-debug which leaves on some optimizations so some call paths seem to be left out of the stack.
I will create a task to implement a dedicated thread for sending notifications with mutex protection - #119
From additional debugging I have done on this, the segfault always appears to happen after a socket has been closed. The associated memory is freed at that point, however the object may still be in use in other parts of the code. We need to implement a accessor method, socket state variable, and ref count to prevent object from being freed while it is still in use.
Describe the bug Version 0.20.15 crashes with below error in debug.log while performing a full sync. The last block processed is 79455.
debug.log message:
To Reproduce AWS m5ad.xlarge instance type running Ubuntu 20.04. Quad core AMD Epyc, 16 GB RAM, 16 GB swap, 150GB NVME
Setup command history:
pocketcoin.conf