riebl / vanetza

Open-source implementation of the ETSI C-ITS protocol stack
Other
197 stars 157 forks source link

Crash from manual runtime scheduling of Received CAM messages #181

Open sivasr1 opened 1 year ago

sivasr1 commented 1 year ago

Hello everyone,

While we are trying to reproduce the socktap binary with CAM application for DEMO purpose we always get a crash over time from the function "ManualRuntime::trigger(Clock::time_point tp)" defined in "vanetza/common/manual_runtime.cpp" This crash (mainly on reception side and rarely on transmission side) happens only after successful processing of many received CAM messages.

From the debugging the main reason for this crash seems to be because of the simultaneous calls to function "void TimeTrigger::schedule()" defined in "tools/socktap/time_trigger.cpp"

NOTE :: The crash issue is always reproducible , but crash happens quicker if the rate of transmitted messages is high ( example :: 100 milliseconds/1 CAM message). So basically receiving 10 messages in a second runs about 40-60 minutes without any issues but then this crash happens. Once we reduce the rate of the receival of the messages from 10 to 1 in a second then actually our system runs for multiple hours before having the same crash.

Please refer commands below for more information related to GDB back trace and socktap commands usage.

Core issue::

socktap: vanetza/common/manual_runtime.cpp:62: void vanetza::ManualRuntime::trigger(vanetza::Clock::time_point): Assertion `tp >= m_now' failed. 8772 64 Aborted (core dumped)

On Transmission side::

./socktap -a ca -p static --latitude 40.8810 --longitude 29.3433 --pos_confidence 5 --print-tx-cam --cam-interval 100 -i lo --mac-address 00:00:00:00:00:11

On Reception Side::

./socktap -a ca -p static --latitude 40.8810 --longitude 29.3433 --pos_confidence 5 --print-rx-cam -i lo --mac-address 00:00:00:00:00:22

GDB backtrace::

> (gdb) bt
 #0  0x0000ffff94f789a8 in raise () from /lib/libc.so.6
 #1  0x0000ffff94f65f54 in abort () from /lib/libc.so.6
 #2  0x0000ffff94f72148 in ?? () from /lib/libc.so.6
 #3  0x0000ffff94f721ac in __assert_fail () from /lib/libc.so.6
 #4  0x0000aaaac557a868 in vanetza::ManualRuntime::trigger (this=this@entry=0xffffee3a61e0, tp=...)
     at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/c++/9.2.0/bits/locale_facets.h:877
> #5  0x0000aaaac54cee64 in TimeTrigger::update_runtime (this=this@entry=0xffffee3a6160)
>     at tools/socktap/time_trigger.cpp:48
> #6  0x0000aaaac54cef48 in TimeTrigger::schedule (this=0xffffee3a6160) at tools/socktap/time_trigger.cpp:25
> #7  0x0000aaaac54d2098 in std::__invoke_impl<void, void (TimeTrigger::*&)(boost::system::error_code const&), TimeTrigger*&, boost::system::error_code const&> (__f=<optimized out>, __f=<optimized out>, __t=<optimized out>)
>     at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/c++/9.2.0/bits/invoke.h:89
> #8  std::__invoke<void (TimeTrigger::*&)(boost::system::error_code const&), TimeTrigger*&, boost::system::error_code const&> (__fn=
>     @0xffffee3a5bd0: (void (TimeTrigger::*)(TimeTrigger * const, const boost::system::error_code &)) 0xaaaac54cf1d0 <TimeTrigger::on_timeout(boost::system::error_code const&)>) at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/c++/9.2.0/bits/invoke.h:95
> #9  std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)>::__call<void, boost::system::error_code const&, 0ul, 1ul>(std::tuple<boost::system::error_code const&>&&, std::_Index_tuple<0ul, 1ul>) (__args=..., this=0xffffee3a5bd0)
>     at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/c++/9.2.0/functional:400
> #10 std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)>::operator()<boost::system::error_code const&, void>(boost::system::error_code const&) (this=0xffffee3a5bd0)
>     at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/c++/9.2.0/functional:484
> #11 boost::asio::detail::binder1<std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)>, boost::system::error_code>::operator()() (this=0xffffee3a5bd0)
>     at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/boost/asio/detail/bind_handler.hpp:171
> #12 boost::asio::asio_handler_invoke<boost::asio::detail::binder1<std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)>, boost::system::error_code> >(boost::asio::detail::binder1<std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)>, boost::system::error_code>&, ...) (function=...)
>     at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/boost/asio/handler_invoke_hook.hpp:88
> #13 boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder1<std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)>, boost::system::error_code>, std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)> >(boost::asio::detail::binder1<std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)>, boost::system::error_code>&, std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)>&) (context=..., function=...)
>     at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/boost/asio/detail/handler_invoke_helpers.hpp:54
> #14 boost::asio::detail::handler_work<std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)>,
>  boost::asio::any_io_executor, void>::complete<boost::asio::detail::binder1<std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)>, boost::system::error_code> >(boost::asio::detail::binder1<std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)>, boost::system::error_code>&, std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)>&) (handler=..., function=..., this=0xffffee3a5c30)
>     at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/boost/asio/detail/handler_work.hpp:520
> #15 boost::asio::detail::wait_handler<std::_Bind<void (TimeTrigger::*(TimeTrigger*, std::_Placeholder<1>))(boost::system::error_code const&)>, boost::asio::any_io_executor>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (owner=0xaaaae0940320, base=<optimized out>)
>     at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/boost/asio/detail/wait_handler.hpp:76
> #16 0x0000aaaac549b08c in boost::asio::detail::scheduler_operation::complete (bytes_transferred=<optimized out>, ec=..., 
>     owner=0xaaaae0940320, this=<optimized out>)
>     at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/boost/asio/detail/scheduler_operation.hpp:40
> #17  boost::asio::detail::scheduler::do_run_one (ec=..., this_thread=..., lock=..., this=0xaaaae0940320)
>     at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/boost/asio/detail/impl/scheduler.ipp:492
> #18 boost::asio::detail::scheduler::run (this=0xaaaae0940320, ec=...)
>     at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/boost/asio/detail/impl/scheduler.ipp:210
> #19 0x0000aaaac549041c in boost::asio::io_context::run (this=0xffffee3a5ec8)
>     at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/boost/asio/impl/io_context.ipp:63
> #20 main (argc=<optimized out>, argv=<optimized out>) at tools/socktap/main.cpp:205
> (gdb)

Thanks Siva

riebl commented 1 year ago

Hi @sivasr1,

can you please elaborate on your changes to the socktap code? You mention "simultaneous calls" to TimeTrigger::schedule, but originally socktap is a purely single-threaded application. What are the values of tp and m_now when the assertion fails?

sivasr1 commented 1 year ago

Hi Reibl,

Thanks for your reply. we made some changes in link layer part (raw_socket_link.cpp , where a new thread is instantiated which calls function "do_receive") to use quectel service. I have attached a couple of documents with the logs( includes current and previous time stamps at which functions are called , and start of functions ..etc).

Thanks Siva old_Rx_log.txt Crash_Feb03_Rx.txt

riebl commented 1 year ago

Okay, multi-threading explains this behaviour: Vanetza is not a thread-safe library, i.e. you can use Vanetza with threads, but you need to add thread synchronization on your own. In your particular case, a (fatal) race condition occurs because the main thread running socktap (the io_service.run() in main.cpp) and your link layer thread may concurrently update the internal timestamp. Your link layer invokes the indication callback which finally invokes RouterContext::indicate. Thus, the whole RouterContext object and all of its members are prone to concurrent access by two threads – a classic race condition.

Instead of adding mutexes everywhere, I suggest having a look at boost::asio::post. With this feature, you can queue the execution of the indication callback invocation to the main thread where the io_service is executed. Calling boost::asio::post is thread-safe.

sivasr1 commented 1 year ago

Hi Riebl,

Sorry for the late reply , I am a novice to boost::asio , not sure if i understood and modified code correctly, i modified code as attached , which is leading to a crash.

In short modified as below:

// Get io_context here and execute [io.context.post](http://io.context.post/) with do_receive
auto ex = socket_.get_executor();//boost::asio::get_associated_executor()
boost::asio::io_context * io_context = ex.target<boost::asio::io_context>();
io_context->post(std::bind(&RawSocketLink::do_receive , this )); // boost::bind not working //error: 'bind' is not a member of 'boost'

GDB BackTrace::

Thread 1 "socktap" received signal SIGSEGV, Segmentation fault.
boost::asio::io_context::initiate_post::operator()<std::_Bind<void (RawSocketLink::*(RawSocketLink*))()> >(std::_Bind<void (RawSocketLink::*(RawSocketLink*))()>&&, boost::asio::io_context*) const (self=<optimized out>,
    handler=..., this=<optimized out>)
    at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/c++/9.2.0/bits/move.h:74
74    forward(typename std::remove_reference<_Tp>::type& __t) noexcept
(gdb) bt
#0  boost::asio::io_context::initiate_post::operator()<std::_Bind<void (RawSocketLink::*(RawSocketLink*))()> >(std::_Bind<void (RawSocketLink::*(RawSocketLink*))()>&&, boost::asio::io_context*) const (self=<optimized out>,
    handler=..., this=<optimized out>)
    at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/c++/9.2.0/bits/move.h:74
#1  boost::asio::detail::completion_handler_async_result<std::_Bind<void (RawSocketLink::*(RawSocketLink*))()>, void ()>::initiate<boost::asio::io_context::initiate_post, std::_Bind<void (RawSocketLink::*(RawSocketLink*))()>, boost::asio::io_context*>(boost::asio::io_context::initiate_post&&, std::_Bind<void (RawSocketLink::*(RawSocketLink*))()>&&, boost::asio::io_context*&&) (token=..., initiation=...)
    at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/boost/asio/async_result.hpp:482
#2  boost::asio::async_initiate<std::_Bind<void (RawSocketLink::*(RawSocketLink*))()>, void (), boost::asio::io_context::initiate_post, boost::asio::io_context*>(boost::asio::io_context::initiate_post&&, std::_Bind<void (RawSocketLink::*(RawSocketLink*))()>&, boost::asio::io_context*&&) (token=..., initiation=...)
    at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/boost/asio/async_result.hpp:896
#3  boost::asio::io_context::post<std::_Bind<void (RawSocketLink::*(RawSocketLink*))()> >(std::_Bind<void (RawSocketLink::*(RawSocketLink*))()>&&) (handler=..., this=<optimized out>)
    at /opt/fsl-imx-wayland/5.4-zeus/sysroots/aarch64-poky-linux/usr/include/boost/asio/impl/io_context.hpp:206
#4  RawSocketLink::RawSocketLink (this=0xaaaaaacdee90, socket=...)
    at /home/siva/newWorkSpace/vanetsoar/tools/socktap/raw_socket_link_soar.cpp:168
#5  0x0000aaaaaab6f308 in create_link_layer (io_context=..., device=..., name=...)
    at /home/siva/newWorkSpace/vanetsoar/tools/socktap/link_layer.cpp:27
#6  0x0000aaaaaab578e8 in main (argc=18, argv=<optimized out>)
    at /home/siva/newWorkSpace/vanetsoar/tools/socktap/main.cpp:104
(gdb) 

Could you please check if my implementation of boost::asio::post is correct ? or would you suggest any other better alternative ? Thanks for your help.

Thanks Siva

riebl commented 1 year ago

It is hard to "debug" your code with little information. You can fetch the Boos.Asio execution context from socket_.get_executor().context() and pass it to boost::asio::post:

auto& ec = socket_.get_executor().context();
boost::asio::post(ec, std::bind(&RawSocketLink::do_receive, this));

If binding do_receive makes sense depends a lot on the rest of your implementation. In the end, callback_ needs to get invoked by the execution context.