openbmc / bmcweb

A do everything Redfish, KVM, GUI, and DBus webserver for OpenBMC
Apache License 2.0
154 stars 131 forks source link

bmcweb core dumps on reboot if outstanding event is present #193

Closed geissonator closed 3 years ago

geissonator commented 3 years ago

Describe the bug bmcweb crashes and core dumps if bmcweb is restarted and there is an outstanding event to be sent in the EventServiceManager

Environment This issue has been found on the IBM rainier platform. The issue has been around for a while but as we add more events, we have started to see this issue more.

This is downstream IBM code, which was rebased with upstream master on March 29th, 2021. Commit 4df1bee0bdc9d71043b51872875d3d22b26ab68f. This downstream code is carrying a variety of patches which are in process in getting into upstream (https event notification, and other IBM HMC specific commits).

To Reproduce

  1. With our downstream code, the easiest way to recreate this issue is to create a /var/lib/bmcweb/eventservice_config.json file with an invalid "Destination". This ensure an event is on the queue for an extended time.
  2. Restart bmcweb to pick up the new config
  3. Cause an event, in our downstream code we monitor BootProgress so I just change that property to cause an event.
  4. Restart bmcweb -> Notice that bmcweb will core dump

I have been unable to recreate this with upstream code because it doesn't appear the full event notification path has made it in yet.

Is this a regression No, I don't think this has ever worked.

Additional Data I installed valgrind on our system and gathered some debug of the failure. There seem to be 3 different types of invalid memory access occurring during the destructor phase of bmcweb. They all seem to be centered around the HttpClient object and the destruction of it's instance variables.

The first two issues revolve around the destruction of this instance variable: std::optional<boost::beast::ssl_stream<boost::beast::tcp_stream&>> sslConn; This is downstream only code which has added support for a ssl connection in the event notifier. I'll include the valgrind data as a reference point but it's pretty moot to debug this as it's currently downstream only.

The third issue could in theory be something valid in upstream because it has to do with the destruction of this variable from within HttpClient: boost::asio::steady_timer timer Similar to above, while destroying the EventServiceManager and it's flat_map of Subscription objects, we get an invalid reference to freed memory deep within the boost::asio::detail::deadline_timer_service code.

  1. The boost socket object is trying to clean up in the destructor path and accesses something (most likely a pointer) that has already been freed

    ==1267== Invalid read of size 4
    ==1267==    at 0x23D828: boost::asio::basic_socket<boost::asio::ip::tcp, boost::asio::executor>::close(boost::system::error_code&) (reactive_socket_service_base
    .ipp:108)
    ==1267==    by 0x1FDC37: boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::impl_type::close() (basic
    _stream.hpp:150)
    ==1267==    by 0x1523C3: crow::HttpClient::~HttpClient() (basic_stream.hpp:726)
    ==1267==    by 0x1F4477: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:158)
    ==1267==    by 0x1DB4BB: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (shared_ptr_base.h:733)
    ==1267==    by 0x3077DB: std::_Sp_counted_ptr_inplace<redfish::Subscription, std::allocator<redfish::Subscription>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (s
    hared_ptr_base.h:1183)
    ==1267==    by 0x1F4477: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:158)
    ==1267==    by 0x1DB4BB: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (shared_ptr_base.h:733)
    ==1267==    by 0x3FA2EF: UnknownInlinedFun (shared_ptr_base.h:1183)
    ==1267==    by 0x3FA2EF: UnknownInlinedFun (shared_ptr.h:121)
    ==1267==    by 0x3FA2EF: boost::container::dtl::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<redfish::S
    ubscription> >::~pair() [clone .isra.0] (pair.hpp:218)
    ==1267==    by 0x3FA31F: UnknownInlinedFun (allocator_traits.hpp:394)
    ==1267==    by 0x3FA31F: UnknownInlinedFun (allocator_traits.hpp:322)
    ==1267==    by 0x3FA31F: UnknownInlinedFun (copy_move_algo.hpp:987)
    ==1267==    by 0x3FA31F: boost::container::vector<boost::container::dtl::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, s
    td::shared_ptr<redfish::Subscription> >, boost::container::new_allocator<boost::container::dtl::pair<std::__cxx11::basic_string<char, std::char_traits<char>, st
    d::allocator<char> >, std::shared_ptr<redfish::Subscription> > >, void>::~vector() [clone .isra.0] (vector.hpp:1099)
    ==1267==    by 0x153F87: redfish::EventServiceManager::~EventServiceManager() (flat_tree.hpp:475)
    ==1267==    by 0x4E29823: __run_exit_handlers (exit.c:108)
    ==1267==  Address 0x56d36dc is 20 bytes inside a block of size 36 free'd
    ==1267==    at 0x4819944: operator delete(void*, unsigned int) (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
    ==1267==  Block was alloc'd at
    ==1267==    at 0x4818228: operator new(unsigned int) (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
  2. Similar path as above except there is a mutex access in this case:

    ==1267== Invalid read of size 4
    ==1267==    at 0x4DD3FCC: pthread_mutex_lock (pthread_mutex_lock.c:67)
    ==1267==    by 0x12ED37: boost::asio::detail::conditionally_enabled_mutex::scoped_lock::scoped_lock(boost::asio::detail::conditionally_enabled_mutex&) (posix_mu
    tex.hpp:52)
    ==1267==    by 0x134E37: boost::asio::detail::epoll_reactor::deregister_descriptor(int, boost::asio::detail::epoll_reactor::descriptor_state*&, bool) (epoll_rea
    ctor.ipp:352)
    ==1267==    by 0x23D82F: boost::asio::basic_socket<boost::asio::ip::tcp, boost::asio::executor>::close(boost::system::error_code&) (reactive_socket_service_base
    .ipp:108)
    ==1267==    by 0x1FDC37: boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::impl_type::close() (basic
    _stream.hpp:150)
    ==1267==    by 0x1523C3: crow::HttpClient::~HttpClient() (basic_stream.hpp:726)
    ==1267==    by 0x1F4477: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:158)
    ==1267==    by 0x1DB4BB: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (shared_ptr_base.h:733)
    ==1267==    by 0x3077DB: std::_Sp_counted_ptr_inplace<redfish::Subscription, std::allocator<redfish::Subscription>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (s
    hared_ptr_base.h:1183)
    ==1267==    by 0x1F4477: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:158)
    ==1267==    by 0x1DB4BB: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (shared_ptr_base.h:733)
    ==1267==    by 0x3FA2EF: UnknownInlinedFun (shared_ptr_base.h:1183)
    ==1267==    by 0x3FA2EF: UnknownInlinedFun (shared_ptr.h:121)
    ==1267==    by 0x3FA2EF: boost::container::dtl::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<redfish::S
    ubscription> >::~pair() [clone .isra.0] (pair.hpp:218)
    ==1267==  Address 0x57d17a4 is 36 bytes inside a block of size 92 free'd
    ==1267==    at 0x4819944: operator delete(void*, unsigned int) (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
    ==1267==  Block was alloc'd at
    ==1267==    at 0x4818228: operator new(unsigned int) (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
  3. Invalid memory access in boost timer implementation in shutdown/cancel path.

==1267== Invalid read of size 4
==1267==    at 0x2021EC: boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_clock, boost::asio::wait_t
raits<std::chrono::_V2::steady_clock> > >::cancel(boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V2::steady_c
lock, boost::asio::wait_traits<std::chrono::_V2::steady_clock> > >::implementation_type&, boost::system::error_code&) (epoll_reactor.hpp:66)
==1267==    by 0x3F55BF: UnknownInlinedFun (deadline_timer_service.hpp:98)
==1267==    by 0x3F55BF: boost::asio::detail::io_object_impl<boost::asio::detail::deadline_timer_service<boost::asio::detail::chrono_time_traits<std::chrono::_V
2::steady_clock, boost::asio::wait_traits<std::chrono::_V2::steady_clock> > >, boost::asio::executor>::~io_object_impl() [clone .isra.0] (io_object_impl.hpp:90)
==1267==    by 0x1523AB: crow::HttpClient::~HttpClient() (basic_waitable_timer.hpp:379)
==1267==    by 0x1F4477: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:158)
==1267==    by 0x1DB4BB: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (shared_ptr_base.h:733)
==1267==    by 0x3077DB: std::_Sp_counted_ptr_inplace<redfish::Subscription, std::allocator<redfish::Subscription>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (s
hared_ptr_base.h:1183)
==1267==    by 0x1F4477: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:158)
==1267==    by 0x1DB4BB: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (shared_ptr_base.h:733)
==1267==    by 0x3FA2EF: UnknownInlinedFun (shared_ptr_base.h:1183)
==1267==    by 0x3FA2EF: UnknownInlinedFun (shared_ptr.h:121)
==1267==    by 0x3FA2EF: boost::container::dtl::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<redfish::S
ubscription> >::~pair() [clone .isra.0] (pair.hpp:218)
==1267==    by 0x3FA31F: UnknownInlinedFun (allocator_traits.hpp:394)
==1267==    by 0x3FA31F: UnknownInlinedFun (allocator_traits.hpp:322)
==1267==    by 0x3FA31F: UnknownInlinedFun (copy_move_algo.hpp:987)
==1267==    by 0x3FA31F: boost::container::vector<boost::container::dtl::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, s
td::shared_ptr<redfish::Subscription> >, boost::container::new_allocator<boost::container::dtl::pair<std::__cxx11::basic_string<char, std::char_traits<char>, st
d::allocator<char> >, std::shared_ptr<redfish::Subscription> > >, void>::~vector() [clone .isra.0] (vector.hpp:1099)
==1267==    by 0x153F87: redfish::EventServiceManager::~EventServiceManager() (flat_tree.hpp:475)
==1267==    by 0x4E29823: __run_exit_handlers (exit.c:108)
==1267==  Address 0x53f81d4 is 20 bytes inside a block of size 120 free'd
==1267==    at 0x4819944: operator delete(void*, unsigned int) (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
==1267==  Block was alloc'd at
==1267==    at 0x4818228: operator new(unsigned int) (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)

I tried a few ideas from Ed in discord which were to use a weak pointer in the timer path and to chack for more specific abort error codes but didn't have luck with either. We probably wont be able to do much with this bug until we get more of the event code into upstream but I wanted to at least get down what we've been seeing and what debug has been done so far.

geissonator commented 3 years ago

As this can only really be recreated with downstream code currently, I'm going to close this issue. It's at least here for reference. If we can get a recreate with upstream code I'll reopen.