pocketnetteam / pocketnet.core

Decentralized social network based on the blockchain
https://pocketnet.app
Apache License 2.0
115 stars 28 forks source link

0.20.15 Out Of Memory Crash with version 0.20.15 #108

Closed tawmaz closed 2 years ago

tawmaz commented 2 years ago

Describe the bug Version 0.20.15 crashes with below error in debug.log while performing a full sync. The last block processed is 79455.

debug.log message:

2021-12-13T18:42:52Z Error: CChainState::NotifyWSClients (1) - bad_weak_ptr
2021-12-13T18:42:52Z Error: Out of memory. Terminating.
2021-12-13T19:02:39Z 

To Reproduce AWS m5ad.xlarge instance type running Ubuntu 20.04. Quad core AMD Epyc, 16 GB RAM, 16 GB swap, 150GB NVME

Setup command history:

   1 sudo mkfs -t xfs /dev/nvme1n1
    2 sudo mkdir /data
    3  sudo mount /dev/nvme1n1 /data
    4  sudo chmod 777 /data
    5  mkdir /data/.pocketcoin
    6  sudo chmod 777 /data/.pocketcoin
    7  ln -s /data/.pocketcoin /home/ubuntu/.pocketcoin
    8  cd ~/.pocketcoin
    9  sudo apt update
   10  sudo apt upgrade
   11  sudo apt install libboost-all-dev
   12  sudo apt install libzmq3-dev
   13  sudo apt-get install libminiupnpc-dev
   14  sudo apt install mosh
   15  sudo fallocate -l 16G /data/swapfile
   16  sudo chmod 600 /data/swapfile
   17  sudo mkswap /data/swapfile
   18  sudo swapon /data/swapfile
   19  sudo swapon -show
   20  sudo apt install awscli
   21  cd ..
   22  wget https://github.com/pocketnetteam/pocketnet.core/releases/download/v0.20.15/pocketnetcore_0.20.15_linux_x64_setup.deb
   23  sudo dpkg -i pocketnetcore_0.20.15_linux_x64_setup.deb 
   26  mkdir checkpoints
   28  cd checkpoints
   29  wget https://github.com/pocketnetteam/pocketnet.core/releases/download/v0.20.15/main.sqlite3
   30  cd ..
   34  pocketcoin-cli logging +stat
   35  tail -f debug.log

pocketcoin.conf

# Enable public API interfaces
api=1
wsuse=1
daemon=1

# Setting ports
publicrpcport=38081
wsport=8087

# Setting work queue and threads
rpcpublicworkqueue=3000
rpcpublicthreads=30
rpcpostworkqueue=1500
rpcpostthreads=15
tawmaz commented 2 years ago

Testing with api=0, node crashes again without an error in the log within 5000 blocks.

tawmaz commented 2 years ago

Testing with wsuse=0, issue seems to be resolved and node is able to sync past 100k blocks without issue so far. I have not seen this issue when syncing a node behind a gateway, only when the IP is public or ports are forwarded to the node.

the-real-vortex-v commented 2 years ago

try not compiling in the upnp library? I'm not using it on my atom low memory box. I'm also using the newer berkeley DB (v5?)

tawmaz commented 2 years ago

The crash appears to be a segmentation fault that occurs while syncing with wsuse=1 in the config file. I reproduced it under gbd but the SSH connection timed out so I was not able to investigate further. Restarting the process to get another repro.

tawmaz commented 2 years ago

Call stack from the sementation fault:

Thread 71 "pocketcoind" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f26cbf13700 (LWP 26241)]
SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}::operator()() const (this=0x7f26cbf12670) at ./websocket/ws.h:227
227     ./websocket/ws.h: No such file or directory.
(gdb) bt
#0  SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}::operator()() const (this=0x7f26cbf12670) at ./websocket/ws.h:227
#1  boost::asio::asio_handler_invoke<SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}>(SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}&, ...) (function=...) at /usr/include/boost/asio/handler_invoke_hook.hpp:69
#2  boost_asio_handler_invoke_helpers::invoke<SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}, {lambda()#1}>(SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}&, {lambda()#1}&) (context=..., function=...)
    at /usr/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#3  boost::asio::detail::handler_work<SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}, boost::asio::system_executor, SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}>::complete<{lambda()#1}>({lambda()#1}&, {lambda()#1}&) (
    handler=..., function=..., this=<synthetic pointer>) at /usr/include/boost/asio/detail/handler_work.hpp:100
#4  boost::asio::detail::completion_handler<SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (owner=<optimized out>, base=<optimized out>) at /usr/include/boost/asio/detail/completion_handler.hpp:70
#5  0x0000560af0103ee2 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=0, ec=..., owner=0x7f27380d1a50, 
    this=<optimized out>) at /usr/include/boost/asio/detail/scheduler_operation.hpp:40
#6  boost::asio::detail::strand_service::do_complete (owner=0x7f27380d1a50, base=0x7f273828d130, ec=...)
    at /usr/include/boost/asio/detail/impl/strand_service.ipp:168
#7  0x0000560af01023a6 in boost::asio::detail::scheduler_operation::complete (bytes_transferred=0, ec=..., owner=0x7f27380d1a50, 
    this=0x7f273828d130) at /usr/include/boost/asio/detail/scheduler_operation.hpp:40
#8  boost::asio::detail::scheduler::do_run_one (ec=..., this_thread=..., lock=..., this=<optimized out>)
    at /usr/include/boost/asio/detail/impl/scheduler.ipp:447
#9  boost::asio::detail::scheduler::run (this=0x7f27380d1a50, ec=...) at /usr/include/boost/asio/detail/impl/scheduler.ipp:200
#10 0x0000560af00ecd09 in boost::asio::io_context::run (this=<optimized out>) at /usr/include/boost/asio/impl/io_context.ipp:63
#11 SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::accept_and_run (
    this=0x7f26cbf12960) at ./websocket/ws.h:457
#12 SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::start (this=0x7f26cbf12960)
    at ./websocket/ws.h:468
#13 StartWS () at init.cpp:1599
#14 0x0000560af00f6396 in std::__invoke_impl<void, void (*)()> (__f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89
#15 std::__invoke<void (*)()> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95
#16 std::thread::_Invoker<std::tuple<void (*)()> >::_M_invoke<0ul> (this=<optimized out>) at /usr/include/c++/9/thread:244
#17 std::thread::_Invoker<std::tuple<void (*)()> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
#18 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)()> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195
#19 0x00007f2741f14de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#20 0x00007f2742642609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#21 0x00007f2741c02293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
tawmaz commented 2 years ago

Segmentation fault occurs here: https://github.com/pocketnetteam/pocketnet.core/blob/0.20/src/websocket/ws.h#L227

Crashes because "*self" is null gdb:

(gdb) frame 0
#0  SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection::send_from_queue()::{lambda()#1}::operator()() const (this=0x7f26cbf12670) at ./websocket/ws.h:227
227     in ./websocket/ws.h
(gdb) print self
$1 = std::shared_ptr<SimpleWeb::SocketServerBase<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor> >::Connection> (use count 327682, weak count 458758) = {get() = 0x0}
(gdb) print *self
Cannot access memory at address 0x0
andyoknen commented 2 years ago

I am extremely confused by the stack - the call to send messages via a web socket is initiated here: src/validation.cpp:3139

Until we figure out the problem, we can try to disable this functionality with a simple check: if (now() - chainActive.Tip().nTime > 3600s) return;

There is also not enough blocking for the NotifiWSClients method in validation.cpp:3139 and logics in init.cpp:1521 To do this, we need to implement a class with a dedicated thread and a queue of blocks for processing in order to exclude the process of sending notifications from the cs_main stream

tawmaz commented 2 years ago

send is being called from here: https://github.com/pocketnetteam/pocketnet.core/blob/0.20/src/validation.cpp#L3422 I added some debug tracing to be sure. I built with --enable-debug which leaves on some optimizations so some call paths seem to be left out of the stack.

andyoknen commented 2 years ago

I will create a task to implement a dedicated thread for sending notifications with mutex protection - #119

tawmaz commented 2 years ago

From additional debugging I have done on this, the segfault always appears to happen after a socket has been closed. The associated memory is freed at that point, however the object may still be in use in other parts of the code. We need to implement a accessor method, socket state variable, and ref count to prevent object from being freed while it is still in use.

andyoknen commented 2 years ago

https://github.com/pocketnetteam/pocketnet.core/issues/119