hubblo-org / scaphandre

⚡ Energy consumption metrology agent. Let "scaph" dive and bring back the metrics that will help you make your systems and applications more sustainable !
Apache License 2.0
1.63k stars 109 forks source link

Random failure in rocky linux based custom container #380

Open GregWhiteyBialas opened 6 months ago

GregWhiteyBialas commented 6 months ago

Bug description

I have build container with rpm based scaphandre installation. I am starting it on bare metal with 'prometheus --qemu" option. In docker logs I see:

scaphandre::sensors: Sysinfo sees 256
Scaphandre stdout exporter
Sending ⚡ metrics
Measurement step is: 2s

when I try to curl http://localhost:8080/metrics I don't get any output on console and in logs I see:

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/sensors/utils.rs:177:18
scaphandre::exporters::prometheus: Error in show_metrics : PoisonError { .. }
scaphandre::exporters::prometheus: Error details : poisoned lock: another task failed inside

each next run of curl produces:

scaphandre::exporters::prometheus: Error in show_metrics : PoisonError { .. }
scaphandre::exporters::prometheus: Error details : poisoned lock: another task failed inside

Once in a few runs scaphandre is starting properly and I am able to scrap metrics. I have done a lot of tests to determine when it happens (without changing ownership and access rights to /sys/clss/powercap (so without running init.sh), after reboot (to clean ownership of /sys), restarting docker container, purging docker, running scaphandre with stdout option, etc) and I didn't find anything conclusive.

Bellow there is console output where I run scaphandre few times before it works , after few unsuccessful attempts to start it.

(kolla-ansible) [stack@hpc30 ~]$ docker run -v /sys/class/powercap:/sys/class/powercap -v /proc:/proc -ti --network host -e RUST_BACKTRACE=full kolla/scaphandre:17.1.0  scaphandre stdout -t 5
scaphandre::sensors: Sysinfo sees 256
Scaphandre stdout exporter
Sending ⚡ metrics
Measurement step is: 2s
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/sensors/utils.rs:177:18
stack backtrace:
   0:     0x5576c0d21f41 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hf66164b97344d0a2
   1:     0x5576c0d4a4af - core::fmt::write::hbb74f2248ccd4395
   2:     0x5576c0d1edb1 - std::io::Write::write_fmt::hed9c5edae1eac7b4
   3:     0x5576c0d21d55 - std::sys_common::backtrace::print::hc9a6bb05c1f66b1d
   4:     0x5576c0d231f7 - std::panicking::default_hook::{{closure}}::h617bee45ce760ff9
   5:     0x5576c0d22fe4 - std::panicking::default_hook::hfb5619c23c95dafb
   6:     0x5576c0d236ac - std::panicking::rust_panic_with_hook::h07253f826b957552
   7:     0x5576c0d23561 - std::panicking::begin_panic_handler::{{closure}}::hfde4141a9de96c92
   8:     0x5576c0d22376 - std::sys_common::backtrace::__rust_end_short_backtrace::he15cde744ac23f89
   9:     0x5576c0d232f2 - rust_begin_unwind
  10:     0x5576c06be443 - core::panicking::panic_fmt::h2494779393265ba8
  11:     0x5576c06be4d3 - core::panicking::panic::hfcc79b23445abeb8
  12:     0x5576c0799450 - scaphandre::exporters::MetricGenerator::gen_self_metrics::h280d657f7d304306
  13:     0x5576c07a208b - scaphandre::exporters::MetricGenerator::gen_all_metrics::h63813309d030eccd
  14:     0x5576c07b54a3 - scaphandre::exporters::stdout::StdoutExporter::iterate::h06a8bbbbab974fa2
  15:     0x5576c07b52c8 - <scaphandre::exporters::stdout::StdoutExporter as scaphandre::exporters::Exporter>::run::hd0394d843640f8d2
  16:     0x5576c06d6203 - scaphandre::main::h75d3d0458ba1b902
  17:     0x5576c06cdfd3 - std::sys_common::backtrace::__rust_begin_short_backtrace::hd18dc57ef0d20d7c
  18:     0x5576c06c9ad9 - std::rt::lang_start::{{closure}}::he293a497447ace7d
  19:     0x5576c0d18ef5 - std::rt::lang_start_internal::he62005167fe2938d
  20:     0x5576c06d9c95 - main
  21:     0x7f1a5f2b4eb0 - __libc_start_call_main
  22:     0x7f1a5f2b4f60 - __libc_start_main_alias_1
  23:     0x5576c06bebf5 - _start
  24:                0x0 - <unknown>
(kolla-ansible) [stack@hpc30 ~]$ docker run -v /sys/class/powercap:/sys/class/powercap -v /proc:/proc -ti --network host -e RUST_BACKTRACE=full kolla/scaphandre:17.1.0  scaphandre stdout -t 5
scaphandre::sensors: Sysinfo sees 256
Scaphandre stdout exporter
Sending ⚡ metrics
Measurement step is: 2s
scaphandre::sensors: Not enough records for socket
scaphandre::sensors: Not enough records for socket
Host:   0 W from
        package         core
Top 5 consumers:
Power           PID     Exe
No processes found yet or filter returns no value.
------------------------------------------------------------

Host:   167.52704 W from
        package         core
Socket1 83.300095 W |   0.123677 W

Socket0 85.29291 W |    0.137677 W

Top 5 consumers:
Power           PID     Exe
2.625001 W      295896  "/usr/bin/scaphandre"
0.0029199123 W  10613   ""
0.0029199123 W  10718   ""
0.0029199123 W  9711    ""
0.0029199123 W  4934    ""
------------------------------------------------------------

What is strange whenever I start scaphandre using official image it works just fine.

To Reproduce

Build image based on Rocky linux 9.3, install scaphandre rpm in it, and run it.

Expected behavior

Scpahandre prometheus --qemu will start properly each time.

Screenshots

n/a

Environment:

Rocky linux 9.3

 uname -a
Linux hpc30 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Sep 16 09:55:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Additional context

Why I am building docker images instead using official one? I want to add scpahandre support to openstack deployment project kolla-ansible. This effort can be tracked here: https://review.opendev.org/c/openstack/kolla/+/914646/10

bpetit commented 5 days ago

Hi @GregWhiteyBialas,

Could you precise on which tag / branch you build ? Thanks