massalabs / massa

The Decentralized and Scaled Blockchain
https://massa.net
5.57k stars 713 forks source link

Bootstrap retry seems to fail because of double lock #3970

Closed Eitu33 closed 1 year ago

Eitu33 commented 1 year ago

Context

This was found testing the new versioning system, after experiencing a desynchronization, node tried to bootstrap again and process terminated with the following error:

thread 'main' panicked at 'Error setting Ctrl-C handler: MultipleHandlers', massa-node/src/main.rs:253:6
Ben-PH commented 1 year ago

Any suggestions to reproduce?

AurelienFT commented 1 year ago

Any suggestions to reproduce?

in consensus-worker check_desync you can just change the condition to true or true after some slot so that it's triggered directly.

Ben-PH commented 1 year ago

I've found what I beleive to be the culprit: the launch function initialises the (non-async) ctrlc handler, and with a de-sync, the launch fn gets called again, and from the initialiser documentation:

/// # Errors
/// Will return an error if another `ctrlc::set_handler()` handler exists or if a
/// system error occurred while setting the handler.
///

I could put together a fix relatively easily, buuuut....

There are quite a few ctrl-c handlers speckled through the codebase. Might be worth discussing having a single, global ctrlc-signal?

    let sig_int_triggered = Arc::new((Mutex::new(false), Condvar::new()));
    let sigint_triggered_clone = Arc::clone(&sig_int_triggered);
    ctrlc::set_handler(move || {
        *sigint_triggered_clone
            .0
            .lock()
            .expect("double-lock on interupt bool in ctrl-c handler") = true;
        sigint_triggered_clone.1.notify_all();
    })
    .expect("Error setting Ctrl-C handler");

Then wherever the code needs to be sig-int aware, we provide that boolean as appropriate.

in the bootstrap code we have this:

            let int_sig = interupted
                .0
                .lock()
                .expect("double-lock() on interupted signal mutex");
            let wake = interupted
                .1
                .wait_timeout(int_sig, bootstrap_config.retry_delay.to_duration())
                .expect("interupt signal mutex poisoned");
            if *wake.0 {
                return Err(BootstrapError::Interupted(
                    "Sig INT during bootstrap retry-wait".to_string(),
                ));
            }

under the hood, the sigint variable gets passed into a dedicated system thread, and is in an event loop. in our case

where we have code like this:

        // interrupt signal listener
        let (tx, rx) = crossbeam_channel::bounded(1);
        let interrupt_signal_listener = tokio::spawn(async move {
            signal::ctrl_c().await.unwrap();
            tx.send(()).unwrap();
        });

        /* ... */
        match rx.try_recv() {
                Ok(_) => {
                    info!("interrupt signal received");
                    break false;
                }
                Err(crossbeam_channel::TryRecvError::Disconnected) => {
                    error!("interrupt_signal_listener disconnected");
                    break false;
                }
                _ => {}
            }

we would instead have

            if !global_ctrlc.0.lock().unwrap().clone() {
                break false;
            }

...though I would like to set things up so that it can take a RwLock instead (currently cannot)

The teams input would be valuable

AurelienFT commented 1 year ago

Personally, I'm having hard time understanding if you can make a PR it would be more comprehensive for me I think. But the priority is for the fix

Ben-PH commented 1 year ago

3971

I'll get to work on the PR to see if we can have a single toggle

Eitu33 commented 1 year ago

I'm having the same issue on first bootstrap with labnet

Eitu33 commented 1 year ago

Still having the issue with https://github.com/massalabs/massa/pull/3971 fix, will be removing the control c handler for now

Ben-PH commented 1 year ago

For context: I've been sick/away for the better part of the last two weeks, and this is my first day back, hence the gap.

This is the sequence of events, and work done today:

  1. It initially got triggered on re-sync: On review of the code, that makes sense, as the function I used is a once-only thing, and didn't notice it was in the loop that doesn't break on the resync.
  2. I reproduce the panic by forcing resync. then move the initialization out of the loop. It didn't panic on resync: That was the measure by which I thought the issue fixed (no more re-initializing)
  3. I pushed the fix (now without the resync being forced)
  4. Thomas reported still having issues: commented out the sigint handling initialization
  5. I start trying to resolve by reverting comment
  6. Can't work out how recreate the bug.

This is what I've tried to do to reproduce the bug

...no luck on any front.