mit-dci / opencbdc-tx

A transaction processor for a hypothetical, general-purpose, central bank digital currency
Other
896 stars 199 forks source link

System hangs during raft node initialization on macOS in situations where initialization should fail #281

Closed maurermi closed 1 month ago

maurermi commented 1 month ago

Affected Branch

We have observed that basic_raft_cluster_failure_test hangs on macOS (observed in the macOS CI, as well as on an M3 Mac running macOS Sonoma). This is because of the following block of code in util/raft/node.cpp:46-63

        m_raft_instance = m_launcher.init(m_sm,
                                          m_smgr,
                                          m_raft_logger,
                                          m_port,
                                          m_asio_opt,
                                          params,
                                          m_init_opts);

        if(!m_raft_instance) {
            m_log->error("Failed to initialize raft launcher");
            return false;
        }

        m_log->info("Waiting for raft initialization");
        static constexpr auto wait_time = std::chrono::milliseconds(100);
        while(!m_raft_instance->is_initialized()) {
            std::this_thread::sleep_for(wait_time);
        }

On MacOS, m_launcher.init() returns true in situations where the raft instance cannot successfully be initialized, causing the waiting loop to be infinite. This does not appear to happen on Linux (verified on Ubuntu).

This error occurs in the NuRaft codebase, and so I propose two potential solutions here

  1. We are currently using NuRaft v1.3.0, whereas NuRaft is currently on version 2.1.0. We should investigate whether this problem has been solved to this point and consider upgrading.
  2. We should add a timeout such that we never wait longer than for raft initialization. This is likely a wise addition whether or not we upgrade NuRaft.

Basic Diagnostics

Description

In order to reproduce the issue, follow these steps:

  1. Run the basic_raft_cluster_failure_test on MacOS

Code of Conduct

maurermi commented 1 month ago

Assigning this to @eolesinski

HalosGhost commented 1 month ago

via #290