paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.89k stars 696 forks source link

polkadot proces panicked at 'can not convert float seconds to Duration: value is negative' #817

Closed stakeworld closed 1 year ago

stakeworld commented 2 years ago

While being in the active validator set the process crashed, it did restart and since is running and validating normal again.

Polkadot LOG May 29 13:48:14 00.stakeworld.nl polkadot[6173]: Thread 'tokio-runtime-worker' panicked at 'can not convert float seconds to Duration: value is negative', /rustc/7737e0b5c4103216d6fd8cf94> May 29 13:48:14 00.stakeworld.nl polkadot[6173]: This is a bug. Please report it at: May 29 13:48:14 00.stakeworld.nl polkadot[6173]: https://github.com/paritytech/polkadot/issues/new

Dmesg output [May29 13:48] tokio-runtime-w[6345]: segfault at 0 ip 0000000000000000 sp 00007f13ac76eba8 error 14 in polkadot[5621a0f89000+247000] [ +0.000081] Code: Bad RIP value.

bkchr commented 2 years ago

Can you please show the entire stack trace? Or more logs around the error?

stakeworld commented 2 years ago

I am sorry, i don't have more info for now. I had a lot of problems on that particular node so i've moved to other hardware, since then I didn't see the problem so it could be hardware related. Will report back if it happens again.

polkalegos commented 2 years ago

I've got the same error running Docker images v0.9.26:

====================

Version: 0.9.26-d8785970175

   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: start_thread
  16: clone

Thread 'tokio-runtime-worker' panicked at 'can not convert float seconds to Duration: value is negative', /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/time.rs:782

This is a bug. Please report it at:

    https://github.com/paritytech/polkadot/issues/new
bkchr commented 2 years ago

@polkalegos does this happen multiple times?

polkalegos commented 2 years ago

So far just one @bkchr

ggwpez commented 2 years ago

Very interesting... the std documents this error as only happening in try_from_secs_f* or from_secs_f*.
For what its worth I did a grep -Rni vendor/ -e '.*from_secs_.*' on the Polkadot vendor/ directory after a cargo vendor.
But all these cases are in not-live code :thinking:
So no idea where this is being called, not even in a dependency.

vendor/sc-sysinfo/src/lib.rs:70:    pub fn from_secs_f32(secs: f32) -> Self {
vendor/sc-sysinfo/src/lib.rs:71:        Self::MaxDuration(Duration::from_secs_f32(secs))
vendor/frame-benchmarking-cli/src/machine/mod.rs:144:       let verify_limit = ExecutionLimit::from_secs_f32(self.verify_duration);
vendor/frame-benchmarking-cli/src/machine/mod.rs:145:       let disk_limit = ExecutionLimit::from_secs_f32(self.disk_duration);
vendor/frame-benchmarking-cli/src/machine/mod.rs:146:       let hash_limit = ExecutionLimit::from_secs_f32(self.hash_duration);
vendor/frame-benchmarking-cli/src/machine/mod.rs:147:       let memory_limit = ExecutionLimit::from_secs_f32(self.memory_duration);
koute commented 2 years ago

@polkalegos Just for reference, can you tell me which exact Docker image you were running?

polkalegos commented 2 years ago

@koute latest tag version 0.9.26

polkalegos commented 2 years ago

@koute happening again on v0.9.29:

Version: 0.9.29-94078b44fb6

   0: sp_panic_handler::set::{{closure}}
   1: std::panicking::rust_panic_with_hook
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:702:17
   2: std::panicking::begin_panic_handler::{{closure}}
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:588:13
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/sys_common/backtrace.rs:138:18
   4: rust_begin_unwind
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
   5: core::panicking::panic_fmt
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
   6: core::panicking::panic_display
   7: <sc_finality_grandpa::communication::gossip::GossipValidator<Block> as sc_network_gossip::validator::Validator<Block>>::message_allowed::{{closure}}
   8: <sc_network_gossip::bridge::GossipEngine<B> as core::future::future::Future>::poll
   9: <sc_finality_grandpa::communication::NetworkBridge<B,N> as core::future::future::Future>::poll
  10: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
  11: tokio::runtime::task::raw::poll
  12: std::sys_common::backtrace::__rust_begin_short_backtrace
  13: core::ops::function::FnOnce::call_once{{vtable.shim}}
  14: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/alloc/src/boxed.rs:1872:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/alloc/src/boxed.rs:1872:9
      std::sys::unix::thread::Thread::new::thread_start
             at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/sys/unix/thread.rs:108:17
  15: start_thread
  16: clone

Thread 'tokio-runtime-worker' panicked at 'can not convert float seconds to Duration: value is negative', /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/time.rs:782

This is a bug. Please report it at:

    https://github.com/paritytech/polkadot/issues/new
koute commented 2 years ago

@polkalegos Thank you for the backtrace!

I'll take a look at this.

arkpar commented 2 years ago

I've ran into similar issue once. The problem was that Instance counters are not consistent across threads. I.e creating an instance with Instance::now and then calling elapsed on it in another thread may panic.

koute commented 2 years ago

I've ran into similar issue once. The problem was that Instance counters are not consistent across threads. I.e creating an instance with Instance::now and then calling elapsed on it in another thread may panic.

This is a good point, however AFAIK since rust 1.60 it's supposed to saturate to zero and not panic anymore. The elapsed substracts two Instants and that calls duration_since which does an unwrap_or_default.

Anyway, looking at the code, here's where the error's defined:

pub struct FromFloatSecsError {
    kind: FromFloatSecsErrorKind,
}

impl FromFloatSecsError {
    const fn description(&self) -> &'static str {
        match self.kind {
            FromFloatSecsErrorKind::Negative => {
                "can not convert float seconds to Duration: value is negative"
            }
            FromFloatSecsErrorKind::OverflowOrNan => {
                "can not convert float seconds to Duration: value is either too big or NaN"
            }
        }
    }
}

So we're getting hit by FromFloatSecsErrorKind::Negative.

The only place it's used is in this macro:

macro_rules! try_from_secs {
    (
        secs = $secs: expr,
        mantissa_bits = $mant_bits: literal,
        exponent_bits = $exp_bits: literal,
        offset = $offset: literal,
        bits_ty = $bits_ty:ty,
        double_ty = $double_ty:ty,
    ) => {{
        // ...
        if $secs.is_sign_negative() {
            return Err(FromFloatSecsError { kind: FromFloatSecsErrorKind::Negative });
        }

So the $secs passed here must be negative. This macro is used in these places:

impl Duration {
    // ...
    pub const fn try_from_secs_f32(secs: f32) -> Result<Duration, FromFloatSecsError> {
        try_from_secs!(
            secs = secs,
            mantissa_bits = 23,
            exponent_bits = 8,
            offset = 41,
            bits_ty = u32,
            double_ty = u64,
        )
    }
    // ...
    pub const fn try_from_secs_f64(secs: f64) -> Result<Duration, FromFloatSecsError> {
        try_from_secs!(
            secs = secs,
            mantissa_bits = 52,
            exponent_bits = 11,
            offset = 44,
            bits_ty = u64,
            double_ty = u128,
        )
    }
}
impl Duration {
    // ...
    pub const fn from_secs_f64(secs: f64) -> Duration {
        match Duration::try_from_secs_f64(secs) {
            Ok(v) => v,
            Err(e) => panic!("{}", e.description()),
        }
    }
    // ...
    pub const fn from_secs_f32(secs: f32) -> Duration {
        match Duration::try_from_secs_f32(secs) {
            Ok(v) => v,
            Err(e) => panic!("{}", e.description()),
        }
    }
    // ...
}
impl Duration {
    // ...
    pub const fn mul_f64(self, rhs: f64) -> Duration {
        Duration::from_secs_f64(rhs * self.as_secs_f64())
    }
    // ...
    pub const fn mul_f32(self, rhs: f32) -> Duration {
        Duration::from_secs_f32(rhs * self.as_secs_f32())
    }
    // ...
    pub const fn div_f64(self, rhs: f64) -> Duration {
        Duration::from_secs_f64(self.as_secs_f64() / rhs)
    }
    // ...
    pub const fn div_f32(self, rhs: f32) -> Duration {
        Duration::from_secs_f32(self.as_secs_f32() / rhs)
    }
}

So basically anywhere we either create a Duration from f32/f64, or multiply by f32/f64 the result must be positive, or we'll panic.

Now let's look at the GossipValidator::message_allowed in substrate/client/finality-grandpa/src/communication/gossip.rs. The backtrace has a closure so the issue's probably somewhere here:

Box::new(move |who, intent, topic, mut data| {
    if let MessageIntent::PeriodicRebroadcast = intent {
        return do_rebroadcast
    }

    let peer = match inner.peers.peer(who) {
        None => return false,
        Some(x) => x,
    };

    // if the topic is not something we're keeping at the moment,
    // do not send.
    let (maybe_round, set_id) = match inner.live_topics.topic_info(topic) {
        None => return false,
        Some(x) => x,
    };

    if let MessageIntent::Broadcast = intent {
        if maybe_round.is_some() {
            if !inner.round_message_allowed(who) {
                // early return if the vote message isn't allowed at this stage.
                return false
            }
        } else if !inner.global_message_allowed(who) {
            // early return if the global message isn't allowed at this stage.
            return false
        }
    }

    // if the topic is not something the peer accepts, discard.
    if let Some(round) = maybe_round {
        return peer.view.consider_vote(round, set_id) == Consider::Accept
    }

    // global message.
    let local_view = match inner.local_view {
        Some(ref v) => v,
        None => return false, // cannot evaluate until we have a local view.
    };

    match GossipMessage::<Block>::decode(&mut data) {
        Err(_) => false,
        Ok(GossipMessage::Commit(full)) => {
            // we only broadcast commit messages if they're for the same
            // set the peer is in and if the commit is better than the
            // last received by peer, additionally we make sure to only
            // broadcast our best commit.
            peer.view.consider_global(set_id, full.message.target_number) ==
                Consider::Accept && Some(&full.message.target_number) ==
                local_view.last_commit_height()
        },
        Ok(GossipMessage::Neighbor(_)) => false,
        Ok(GossipMessage::CatchUpRequest(_)) => false,
        Ok(GossipMessage::CatchUp(_)) => false,
        Ok(GossipMessage::Vote(_)) => false, // should not be the case.
    }
})

Let's look at the round_message_allowed and global_message_allowed which are the only places where the affected functions are used from what I can see:

fn round_message_allowed(&self, who: &PeerId) -> bool {
    let round_duration = self.config.gossip_duration * ROUND_DURATION;
    let round_elapsed = match self.local_view {
        Some(ref local_view) => local_view.round_start.elapsed(),
        None => return false,
    };

    if round_elapsed < round_duration.mul_f32(PROPAGATION_SOME) {
        self.peers.first_stage_peers.contains(who)
    } else if round_elapsed < round_duration.mul_f32(PROPAGATION_ALL) {
        self.peers.first_stage_peers.contains(who) ||
            self.peers.second_stage_peers.contains(who)
    } else {
        self.peers.peer(who).map(|info| !info.roles.is_light()).unwrap_or(false)
    }
}
// ...
fn global_message_allowed(&self, who: &PeerId) -> bool {
    let round_duration = self.config.gossip_duration * ROUND_DURATION;
    let round_elapsed = match self.local_view {
        Some(ref local_view) => local_view.round_start.elapsed(),
        None => return false,
    };

    if round_elapsed < round_duration.mul_f32(PROPAGATION_ALL) {
        self.peers.first_stage_peers.contains(who) ||
            self.peers.second_stage_peers.contains(who) ||
            self.peers.lucky_light_peers.contains(who)
    } else {
        true
    }
}

Both use mul_f32 which can panic with this error message!

But now here's the weird part; let me list out all of the relevant constants and variables:

const ROUND_DURATION: u32 = 5;
const PROPAGATION_SOME: f32 = 1.5;
const PROPAGATION_ALL: f32 = 3.0;

#[derive(Clone)]
pub struct Config {
    // ...
    pub gossip_duration: Duration,
    // ...
}

// In polkadot:
let gossip_duration = if ... { Duration::from_millis(2000) } else {Duration::from_millis(1000) }

// In substrate:
gossip_duration: std::time::Duration::from_millis(333)

Everything here is positive. So this is really weird; unless I'm missing something here this panic should be impossible to trigger, unless the Config structure gets corrupted, there's an hardware issue, or there's a bug in the kernel and it clobbers the floating point registers by mistake (since normally the floating point registers are not saved and reloaded on a context switch).

@polkalegos Can you give us more details as to the system on which you've seen this problem?

  1. Where exactly are you running this? (AWS? GCP? Hetzner? etc.)
  2. What are the hardware specs of the system?
  3. Which distro are you using?
  4. What's the kernel version on which you are running?
  5. Does the system have ECC memory?
  6. Can you do cat /proc/cpuinfo and paste us the output?
  7. Do you rent the whole bare-metal system or do you just rent a VM? If you rent the whole bare-metal machine can you run edac-util -v (you might have to install this package) to check if there were any ECC errors?
stakeworld commented 2 years ago

The original report was on a hetzner VPS, which I think is very underperforming so it could be a related to hardware/memory problems. Since I moved to dedicated hardware I haven't seen the problem repeat.

ggwpez commented 2 years ago

(from Rust Doc)

To work around these bugs and platforms not offering monotonic clocks duration_since, elapsed and sub saturate to zero. In older Rust versions this lead to a panic instead. checked_duration_since can be used to detect and handle situations where monotonicity is violated, or Instants are subtracted in the wrong order.

Not sure what role Docker and VPS play here. If it runs in some Qemu or whatever maybe it breaks the monotic clock somehow.

From looking at the code I came to the same conclusion that only mul_f32 would cause this, but it is called with sane constants. So should be fine.