prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.07k stars 2.34k forks source link

MCE (Machine Check Exception) telemetry #986

Closed prateek closed 5 years ago

prateek commented 6 years ago

The edac subsystem is unable to detect un-correctible memory errors when a MCE occurs 1. To detect such issues, people have relied on mcelog 2, e.g. the collectd plugin for mcelog3.

Where do you think mcelog integration into the prom world belongs- as a plugin for node_exporter? A separate exporter all together? Something else?

SuperQ commented 6 years ago

According to my research, mcelog is considered obsolete0. It has been replaced with rasdaemon. We will have to look into how this actually works to figure out a good solution.

SuperQ commented 6 years ago

According to the release announcement, it looks like it gathers data from /sys/kernel/debug/tracing/per_cpu/cpu*/trace_pipe_raw, this path is not available to non-root.

The best option here is to probably write a textfile metrics generator.

discordianfish commented 6 years ago

Yes, nothing we can add here. I'd argue the best option is adding prometheus metrics to rasdaemon :)

SuperQ commented 6 years ago

Anyone want to write a Go version of rasdaemon? :wink:

gebi commented 5 years ago

JFTR... rasdaemon is started with sqlite support, at least on debian, that means you can do things like (crude hack).

$ sqlite3 /var/lib/rasdaemon/ras-mc_event.db 'select count(*) from mce_record;
16

And the the output that you have 16 mce_record errors recorded.

$ sqlite3 /var/lib/rasdaemon/ras-mc_event.db
sqlite> .schema
CREATE TABLE mc_event (id INTEGER PRIMARY KEY, timestamp TEXT, err_count INTEGER, err_type TEXT, err_msg TEXT, label TEXT, mc INTEGER, top_layer INTEGER, middle_layer INTEGER, lower_layer INTEGER, address INTEGER, grain INTEGER, syndrome INTEGER, driver_detail TEXT);
CREATE TABLE aer_event (id INTEGER PRIMARY KEY, timestamp TEXT, err_type TEXT, err_msg TEXT);
CREATE TABLE extlog_event (id INTEGER PRIMARY KEY, timestamp TEXT, etype INTEGER, error_count INTEGER, severity INTEGER, address INTEGER, fru_id BLOB, fru_text TEXT, cper_data BLOB);
CREATE TABLE mce_record (id INTEGER PRIMARY KEY, timestamp TEXT, mcgcap INTEGER, mcgstatus INTEGER, status INTEGER, addr INTEGER, misc INTEGER, ip INTEGER, tsc INTEGER, walltime INTEGER, cpu INTEGER, cpuid INTEGER, apicid INTEGER, socketid INTEGER, cs INTEGER, bank INTEGER, cpuvendor INTEGER, bank_name TEXT, error_msg TEXT, mcgstatus_msg TEXT, mcistatus_msg TEXT, mcastatus_msg TEXT, user_action TEXT, mc_location TEXT);

though no idea if this schema is anyway near considered stable

discordianfish commented 5 years ago

@gebi Good find, people could turn this into a textfile collector script.

gebi commented 5 years ago

@discordianfish we are just using a single sh line with telegraf the inputs.exec pluging and influx format.

(sorry i don't have this exact example here, but something like)

[[inputs.exec]]
    commands = ["sh -c 'cmd_=catdoc; echo -n process,type=xapian,name=$cmd_ count=$(pgrep -c -x $cmd_)i'"]
    data_format = "influx"
    timeout = "5s"

Gives telegraf metrik output: process,host=myhost,name=catdoc,type=xapian count=0i 1551228538000000000

And produces the following prometheus metrik output: process_count{host="myhost",name="catdoc",type="xapian"} 0

mlausch1963 commented 4 years ago

I've written a small go program which serves the content of rasdaemon's sqlite3 database as prometheus metric. Alpha implementation in https://git.bofh.at/mla/rasexporter