sony / nmos-cpp

An NMOS (Networked Media Open Specifications) Registry and Node in C++ (IS-04, IS-05)
Apache License 2.0
136 stars 79 forks source link

Reader/writer model locking #315

Open garethsb opened 1 year ago

garethsb commented 1 year ago

I hit a crash in nmos-cpp-registry when load testing...

Config:

I launched 400 Nodes with nmos-multi-node on same host as the Registry, along with an NMOS Controller using the Query API WebSocket subscriptions. I thought the Registration API request processing would be throttled by the size of the thread pools for http_listener and pplx, but possibly not (needs more investigation), and ending up hitting the exclusive locker limit of boost::shared_mutex, which results in an uncaught boost::lock_error exception (and the mutex state being incorrect, so just catching the exception isn't going to help).

http://stackoverflow.com/questions/69580941/why-boostshared-mutex-cannot-block-more-than-128-threads

Before we even hit the uncaught boost::lock_error exception, in nmos::send_query_ws_events_thread, there are several non-crash errors caused by the same issue but in places where the exception is caught and logged as an "Implementation error" and causes "HTTP error: An operation was attempted on a nonexistent network connection. [windows:1229]" presumably because a registration request never got a response.

Possible workarounds for the crash itself involve replacing current boost::shared_mutex with an alternative... std::shared_mutex (but that requires C++17), std::shared_timed_mutex (that requires C++14), BOOST_THREAD_PROVIDES_GENERIC_SHARED_MUTEX_ON_WIN and (probably) BOOST_THREAD_V2_SHARED_MUTEX (requires rebuild/nonstandard boost config), std::mutex (means no reader locks, so needs further testing to prove no deadlocks before even getting to performance profiling, though this is exactly what we had before the switch to use shared_mutex, and the fact that the crash is happening suggests that the profile of model reads and writes that we're seeing may not actually benefit from shared_mutex)

FWIW, this issue has presumably been present since 2018... https://github.com/sony/nmos-cpp/commit/9bca76ab957796beabebe2448fd3a2fff34e412d

However, after experimenting on the different ways to resolve this, with limited success, I realised I'd only been trying Debug builds of the Registry... all the workarounds I tried performed perfectly in Release builds... but then so does the current shared_mutex approach. So for now, I'm just leaving this issue here as a reminder, if boost::lock_error rears its ugly head again...

(Tangentially relevant: a useful post which explains clearly why it only makes sense to have one upgradable lock at a time... https://stackoverflow.com/questions/59710699/upgrade-shared-lock-to-unique-lock-usage-timing-and-design And the paper which also describes it... https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3568.html#OverviewUpgrade)