Closed mhristache closed 3 years ago
Currently I have no clues.. Is it easy for you to reproduce this problem after your server restarts?
No, I've only seen it one time and no idea how to reproduce it. Anything that you want me to check if I reproduce it?
Thank you
The issue popped up again on another system and I confirm the issue is caused by a deadlock somewhere in the Prometheus lib and I managed to get a backtrace:
#0 alloc::sync::{{impl}}::drop<prometheus::value::Value<prometheus::atomic64::AtomicU64>> (self=0x7fb515f77a38) at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/alloc/src/sync.rs:1419
#1 core::ptr::drop_in_place<alloc::sync::Arc<prometheus::value::Value<prometheus::atomic64::AtomicU64>>> () at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/ptr/mod.rs:175
#2 core::ptr::drop_in_place<prometheus::counter::GenericCounter<prometheus::atomic64::AtomicU64>> () at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/ptr/mod.rs:175
#3 ebmd::server::handle_ebm_stream_connection (stream=..., spec=..., tx=..., l=..., log_raw_events=false, iias=...) at src/server.rs:124
#4 0x0000560080c2e21c in ebmd::server::start_ebm_stream_instance::{{closure}} () at src/server.rs:73
#5 std::sys_common::backtrace::__rust_begin_short_backtrace<closure-1,()> (f=...) at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/sys_common/backtrace.rs:137
#6 0x0000560080bf1abc in std::thread::{{impl}}::spawn_unchecked::{{closure}}::{{closure}}<closure-1,()> () at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/thread/mod.rs:464
#7 std::panic::{{impl}}::call_once<(),closure-0> (self=..., _args=<optimized out>) at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panic.rs:308
#8 std::panicking::try::do_call<std::panic::AssertUnwindSafe<closure-0>,()> (data=<optimized out>) at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:381
#9 std::panicking::try<(),std::panic::AssertUnwindSafe<closure-0>> (f=...) at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panicking.rs:345
#10 std::panic::catch_unwind<std::panic::AssertUnwindSafe<closure-0>,()> (f=...) at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/panic.rs:382
#11 std::thread::{{impl}}::spawn_unchecked::{{closure}}<closure-1,()> () at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/std/src/thread/mod.rs:463
#12 core::ops::function::FnOnce::call_once<closure-0,()> () at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/core/src/ops/function.rs:227
#13 0x0000560080e3275a in alloc::boxed::{{impl}}::call_once<(),FnOnce<()>> () at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/alloc/src/boxed.rs:1042
#14 alloc::boxed::{{impl}}::call_once<(),alloc::boxed::Box<FnOnce<()>>> () at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/alloc/src/boxed.rs:1042
#15 std::sys::unix::thread::{{impl}}::new::thread_start () at library/std/src/sys/unix/thread.rs:87
#16 0x00007fb5179914f9 in start_thread () from target:/lib64/libpthread.so.0
#17 0x00007fb517fdbfbf in clone () from target:/lib64/libc.so.6
The way I use the lib is to define metrics using register_int_counter_vec!
inside lazy_static
and bump them inside the threads. I create one thread per connection but I usually only have ~8 threads. Each thread will update the same labels.
So, I do something like this:
Globally:
lazy_static! {
static ref RECORDS: IntCounterVec = register_int_counter_vec!(
"some_metric",
"some text",
&["instance_id"]
)
.unwrap();
}
In the thread:
RECORDS
.with_label_values(&[&iid[..]])
.inc_by(records.len() as u64);
I realized my issue was not related to rust-prometheus. The high cpu usage reported for rust-prometheus was just a symptom of another issue in my code causing an infinite loop. Sorry for the wrong report.
Describe the bug I have a multi-threaded TCP server that receives binary data, parse it and creates some Prometheus stats.
One of the threads is clogging the cpu (stuck at 100% CPU usage). I did a perf recording and it looks like most of the CPU is consumed by Prometheus lib (see bellow).
The strange part is that there are no segments sent by the clients but the CPU usage is still 100%.
Also I don't get anything if I strace the thread. So it looks stuck.
Any clue what might be wrong here?
Checking the low level instructions, the CPU seems to be consumed by instructions prefixed by LOCK:
Maybe a deadlock or an infinit loop somewhere?
To Reproduce Not sure, first time I see it.
Expected behavior A clear and concise description of what you expected to happen.
System information
Additional context I am using quite a large number of labels (6) and that can cause ~10k unique metrics.