Closed stakeworld closed 1 year ago
Can you please show the entire stack trace? Or more logs around the error?
I am sorry, i don't have more info for now. I had a lot of problems on that particular node so i've moved to other hardware, since then I didn't see the problem so it could be hardware related. Will report back if it happens again.
I've got the same error running Docker images v0.9.26:
====================
Version: 0.9.26-d8785970175
0: <unknown>
1: <unknown>
2: <unknown>
3: <unknown>
4: <unknown>
5: <unknown>
6: <unknown>
7: <unknown>
8: <unknown>
9: <unknown>
10: <unknown>
11: <unknown>
12: <unknown>
13: <unknown>
14: <unknown>
15: start_thread
16: clone
Thread 'tokio-runtime-worker' panicked at 'can not convert float seconds to Duration: value is negative', /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/time.rs:782
This is a bug. Please report it at:
https://github.com/paritytech/polkadot/issues/new
@polkalegos does this happen multiple times?
So far just one @bkchr
Very interesting... the std documents this error as only happening in try_from_secs_f*
or from_secs_f*
.
For what its worth I did a grep -Rni vendor/ -e '.*from_secs_.*'
on the Polkadot vendor/
directory after a cargo vendor
.
But all these cases are in not-live code :thinking:
So no idea where this is being called, not even in a dependency.
vendor/sc-sysinfo/src/lib.rs:70: pub fn from_secs_f32(secs: f32) -> Self {
vendor/sc-sysinfo/src/lib.rs:71: Self::MaxDuration(Duration::from_secs_f32(secs))
vendor/frame-benchmarking-cli/src/machine/mod.rs:144: let verify_limit = ExecutionLimit::from_secs_f32(self.verify_duration);
vendor/frame-benchmarking-cli/src/machine/mod.rs:145: let disk_limit = ExecutionLimit::from_secs_f32(self.disk_duration);
vendor/frame-benchmarking-cli/src/machine/mod.rs:146: let hash_limit = ExecutionLimit::from_secs_f32(self.hash_duration);
vendor/frame-benchmarking-cli/src/machine/mod.rs:147: let memory_limit = ExecutionLimit::from_secs_f32(self.memory_duration);
@polkalegos Just for reference, can you tell me which exact Docker image you were running?
@koute latest tag version 0.9.26
@koute happening again on v0.9.29
:
Version: 0.9.29-94078b44fb6
0: sp_panic_handler::set::{{closure}}
1: std::panicking::rust_panic_with_hook
at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:702:17
2: std::panicking::begin_panic_handler::{{closure}}
at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:588:13
3: std::sys_common::backtrace::__rust_end_short_backtrace
at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/sys_common/backtrace.rs:138:18
4: rust_begin_unwind
at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
5: core::panicking::panic_fmt
at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
6: core::panicking::panic_display
7: <sc_finality_grandpa::communication::gossip::GossipValidator<Block> as sc_network_gossip::validator::Validator<Block>>::message_allowed::{{closure}}
8: <sc_network_gossip::bridge::GossipEngine<B> as core::future::future::Future>::poll
9: <sc_finality_grandpa::communication::NetworkBridge<B,N> as core::future::future::Future>::poll
10: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
11: tokio::runtime::task::raw::poll
12: std::sys_common::backtrace::__rust_begin_short_backtrace
13: core::ops::function::FnOnce::call_once{{vtable.shim}}
14: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/alloc/src/boxed.rs:1872:9
<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/alloc/src/boxed.rs:1872:9
std::sys::unix::thread::Thread::new::thread_start
at rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/sys/unix/thread.rs:108:17
15: start_thread
16: clone
Thread 'tokio-runtime-worker' panicked at 'can not convert float seconds to Duration: value is negative', /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/time.rs:782
This is a bug. Please report it at:
https://github.com/paritytech/polkadot/issues/new
@polkalegos Thank you for the backtrace!
I'll take a look at this.
I've ran into similar issue once. The problem was that Instance
counters are not consistent across threads. I.e creating an instance with Instance::now
and then calling elapsed
on it in another thread may panic.
I've ran into similar issue once. The problem was that
Instance
counters are not consistent across threads. I.e creating an instance withInstance::now
and then callingelapsed
on it in another thread may panic.
This is a good point, however AFAIK since rust 1.60 it's supposed to saturate to zero and not panic anymore. The elapsed
substracts two Instant
s and that calls duration_since
which does an unwrap_or_default
.
Anyway, looking at the code, here's where the error's defined:
pub struct FromFloatSecsError {
kind: FromFloatSecsErrorKind,
}
impl FromFloatSecsError {
const fn description(&self) -> &'static str {
match self.kind {
FromFloatSecsErrorKind::Negative => {
"can not convert float seconds to Duration: value is negative"
}
FromFloatSecsErrorKind::OverflowOrNan => {
"can not convert float seconds to Duration: value is either too big or NaN"
}
}
}
}
So we're getting hit by FromFloatSecsErrorKind::Negative
.
The only place it's used is in this macro:
macro_rules! try_from_secs {
(
secs = $secs: expr,
mantissa_bits = $mant_bits: literal,
exponent_bits = $exp_bits: literal,
offset = $offset: literal,
bits_ty = $bits_ty:ty,
double_ty = $double_ty:ty,
) => {{
// ...
if $secs.is_sign_negative() {
return Err(FromFloatSecsError { kind: FromFloatSecsErrorKind::Negative });
}
So the $secs
passed here must be negative. This macro is used in these places:
impl Duration {
// ...
pub const fn try_from_secs_f32(secs: f32) -> Result<Duration, FromFloatSecsError> {
try_from_secs!(
secs = secs,
mantissa_bits = 23,
exponent_bits = 8,
offset = 41,
bits_ty = u32,
double_ty = u64,
)
}
// ...
pub const fn try_from_secs_f64(secs: f64) -> Result<Duration, FromFloatSecsError> {
try_from_secs!(
secs = secs,
mantissa_bits = 52,
exponent_bits = 11,
offset = 44,
bits_ty = u64,
double_ty = u128,
)
}
}
impl Duration {
// ...
pub const fn from_secs_f64(secs: f64) -> Duration {
match Duration::try_from_secs_f64(secs) {
Ok(v) => v,
Err(e) => panic!("{}", e.description()),
}
}
// ...
pub const fn from_secs_f32(secs: f32) -> Duration {
match Duration::try_from_secs_f32(secs) {
Ok(v) => v,
Err(e) => panic!("{}", e.description()),
}
}
// ...
}
impl Duration {
// ...
pub const fn mul_f64(self, rhs: f64) -> Duration {
Duration::from_secs_f64(rhs * self.as_secs_f64())
}
// ...
pub const fn mul_f32(self, rhs: f32) -> Duration {
Duration::from_secs_f32(rhs * self.as_secs_f32())
}
// ...
pub const fn div_f64(self, rhs: f64) -> Duration {
Duration::from_secs_f64(self.as_secs_f64() / rhs)
}
// ...
pub const fn div_f32(self, rhs: f32) -> Duration {
Duration::from_secs_f32(self.as_secs_f32() / rhs)
}
}
So basically anywhere we either create a Duration
from f32
/f64
, or multiply by f32
/f64
the result must be positive, or we'll panic.
Now let's look at the GossipValidator::message_allowed in substrate/client/finality-grandpa/src/communication/gossip.rs
. The backtrace has a closure so the issue's probably somewhere here:
Box::new(move |who, intent, topic, mut data| {
if let MessageIntent::PeriodicRebroadcast = intent {
return do_rebroadcast
}
let peer = match inner.peers.peer(who) {
None => return false,
Some(x) => x,
};
// if the topic is not something we're keeping at the moment,
// do not send.
let (maybe_round, set_id) = match inner.live_topics.topic_info(topic) {
None => return false,
Some(x) => x,
};
if let MessageIntent::Broadcast = intent {
if maybe_round.is_some() {
if !inner.round_message_allowed(who) {
// early return if the vote message isn't allowed at this stage.
return false
}
} else if !inner.global_message_allowed(who) {
// early return if the global message isn't allowed at this stage.
return false
}
}
// if the topic is not something the peer accepts, discard.
if let Some(round) = maybe_round {
return peer.view.consider_vote(round, set_id) == Consider::Accept
}
// global message.
let local_view = match inner.local_view {
Some(ref v) => v,
None => return false, // cannot evaluate until we have a local view.
};
match GossipMessage::<Block>::decode(&mut data) {
Err(_) => false,
Ok(GossipMessage::Commit(full)) => {
// we only broadcast commit messages if they're for the same
// set the peer is in and if the commit is better than the
// last received by peer, additionally we make sure to only
// broadcast our best commit.
peer.view.consider_global(set_id, full.message.target_number) ==
Consider::Accept && Some(&full.message.target_number) ==
local_view.last_commit_height()
},
Ok(GossipMessage::Neighbor(_)) => false,
Ok(GossipMessage::CatchUpRequest(_)) => false,
Ok(GossipMessage::CatchUp(_)) => false,
Ok(GossipMessage::Vote(_)) => false, // should not be the case.
}
})
Let's look at the round_message_allowed
and global_message_allowed
which are the only places where the affected functions are used from what I can see:
fn round_message_allowed(&self, who: &PeerId) -> bool {
let round_duration = self.config.gossip_duration * ROUND_DURATION;
let round_elapsed = match self.local_view {
Some(ref local_view) => local_view.round_start.elapsed(),
None => return false,
};
if round_elapsed < round_duration.mul_f32(PROPAGATION_SOME) {
self.peers.first_stage_peers.contains(who)
} else if round_elapsed < round_duration.mul_f32(PROPAGATION_ALL) {
self.peers.first_stage_peers.contains(who) ||
self.peers.second_stage_peers.contains(who)
} else {
self.peers.peer(who).map(|info| !info.roles.is_light()).unwrap_or(false)
}
}
// ...
fn global_message_allowed(&self, who: &PeerId) -> bool {
let round_duration = self.config.gossip_duration * ROUND_DURATION;
let round_elapsed = match self.local_view {
Some(ref local_view) => local_view.round_start.elapsed(),
None => return false,
};
if round_elapsed < round_duration.mul_f32(PROPAGATION_ALL) {
self.peers.first_stage_peers.contains(who) ||
self.peers.second_stage_peers.contains(who) ||
self.peers.lucky_light_peers.contains(who)
} else {
true
}
}
Both use mul_f32
which can panic with this error message!
But now here's the weird part; let me list out all of the relevant constants and variables:
const ROUND_DURATION: u32 = 5;
const PROPAGATION_SOME: f32 = 1.5;
const PROPAGATION_ALL: f32 = 3.0;
#[derive(Clone)]
pub struct Config {
// ...
pub gossip_duration: Duration,
// ...
}
// In polkadot:
let gossip_duration = if ... { Duration::from_millis(2000) } else {Duration::from_millis(1000) }
// In substrate:
gossip_duration: std::time::Duration::from_millis(333)
Everything here is positive. So this is really weird; unless I'm missing something here this panic should be impossible to trigger, unless the Config
structure gets corrupted, there's an hardware issue, or there's a bug in the kernel and it clobbers the floating point registers by mistake (since normally the floating point registers are not saved and reloaded on a context switch).
@polkalegos Can you give us more details as to the system on which you've seen this problem?
cat /proc/cpuinfo
and paste us the output?edac-util -v
(you might have to install this package) to check if there were any ECC errors?The original report was on a hetzner VPS, which I think is very underperforming so it could be a related to hardware/memory problems. Since I moved to dedicated hardware I haven't seen the problem repeat.
(from Rust Doc)
To work around these bugs and platforms not offering monotonic clocks duration_since, elapsed and sub saturate to zero. In older Rust versions this lead to a panic instead. checked_duration_since can be used to detect and handle situations where monotonicity is violated, or Instants are subtracted in the wrong order.
Not sure what role Docker and VPS play here. If it runs in some Qemu or whatever maybe it breaks the monotic clock somehow.
From looking at the code I came to the same conclusion that only mul_f32
would cause this, but it is called with sane constants. So should be fine.
While being in the active validator set the process crashed, it did restart and since is running and validating normal again.
Polkadot LOG May 29 13:48:14 00.stakeworld.nl polkadot[6173]: Thread 'tokio-runtime-worker' panicked at 'can not convert float seconds to Duration: value is negative', /rustc/7737e0b5c4103216d6fd8cf94> May 29 13:48:14 00.stakeworld.nl polkadot[6173]: This is a bug. Please report it at: May 29 13:48:14 00.stakeworld.nl polkadot[6173]: https://github.com/paritytech/polkadot/issues/new
Dmesg output [May29 13:48] tokio-runtime-w[6345]: segfault at 0 ip 0000000000000000 sp 00007f13ac76eba8 error 14 in polkadot[5621a0f89000+247000] [ +0.000081] Code: Bad RIP value.