ncabatoff / process-exporter

Prometheus exporter that mines /proc to report on selected processes
MIT License
1.72k stars 270 forks source link

Process falsely reported as being down #214

Closed dkundo closed 3 years ago

dkundo commented 3 years ago

I'm using process-exporter for monitoring a bunch of daemon processes on thousands of servers. Once in while the namedprocess_namegroup_num_procs metric goes to 0, always for the same daemon. It happens on different servers, but always for the same process. Never happened for other processes. Restarting the process-exporter solves the problem.

The process exporter CLI (using version 0.7.9): /etc/prometheus/process-exporter -config.path /etc/prometheus/process.yml -children=false -threads=false

The matcher:

That's the problematic cmd line: /opt/CPshrd-R81.10/jre_64/bin/java -D_smartview=TRUE -Xdump:directory=/var/log/dump/usermode -Xdump:heap:events=gpf+user -Xdump:tool:none -Xdump:tool:events=gpf+abort+traceassert+corruptcache,priority=1,range=1..0,exec=javaCompress.sh smartview %pid -Xdump:tool:events=systhrow,filter=java/lang/OutOfMemoryError,priority=1,range=1..0,exec=javaCompress.sh smartview %pid -Xdump:tool:events=throw,filter=java/lang/OutOfMemoryError,exec=kill -9 %pid -Xaggressive -Xshareclasses:none -Xgc:scvTenureAge=1,noAdaptiveTenure -Xmx512m -Xms512m -Djava.io.tmpdir=/opt/CPrt-R81.10/tmp -Dfile.encoding=UTF-8 -DDedicatedServer=false -DIsMLM=false -DTaskExecThreads=4 -Dlog4j.configuration=file:/opt/CPrt-R81.10/conf/smartview.log4j.properties -DDMExecPoolSize=20 -Dorg.terracotta.quartz.skipUpdateCheck=true -DRTDIR=/opt/CPrt-R81.10 -Dpath=/opt/CPrt-R81.10/jars/aspectjrt-1.8.9.jar:/opt/CPrt-R81.10/jars/commons-io-2.5.jar:/opt/CPrt-R81.10/jars/commons-lang-2.6.jar:/opt/CPrt-R81.10/jars/cxf-core-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-java2ws-plugin-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-rt-bindings-soap-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-rt-bindings-xml-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-rt-databinding-aegis-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-rt-databinding-jaxb-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-rt-frontend-jaxws-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-rt-frontend-simple-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-rt-javascript-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-rt-transports-http-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-rt-transports-http-jetty-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-rt-ws-addr-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-rt-ws-policy-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-rt-wsdl-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-tools-common-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-tools-java2ws-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-tools-validator-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-tools-wsdlto-core-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-tools-wsdlto-databinding-jaxb-3.1.0.jar:/opt/CPrt-R81.10/jars/cxf-tools-wsdlto-frontend-jaxws-3.1.0.jar:/opt/CPrt-R81.10/jars/java_is.jar:/opt/CPrt-R81.10/jars/java_sic.jar:/opt/CPrt-R81.10/jars/jaxb-api-2.2.7.jar:/opt/CPrt-R81.10/jars/jaxb-core-2.2.7.jar:/opt/CPrt-R81.10/jars/jaxb-impl-2.2.7.jar:/opt/CPrt-R81.10/jars/jaxb-xjc-2.2.11.jar:/opt/CPrt-R81.10/jars/neethi-3.0.3.jar:/opt/CPrt-R81.10/jars/rfl_sic.jar:/opt/CPrt-R81.10/jars/smartview-jetty.jar:/opt/CPrt-R81.10/jars/woodstox-core-asl-4.4.1.jar:/opt/CPrt-R81.10/jars/wsdl4j-1.6.3.jar:/opt/CPrt-R81.10/jars/xmlschema-core-2.2.1.jar: -DSTOP.PORT=8079 -DSTOP.KEY=smartview -jar start.jar OPTIONS=Server,resources,websocket /opt/CPrt-R81.10/conf/smartview-jetty.xml /opt/CPrt-R81.10/conf/smartview-service-jetty.xml

dkundo commented 3 years ago

Update: the "problematic" daemon process is controlled by a watchdog process. It's possible the daemon goes down, but then brought up by the watchdog process after ~2 minutes. That's where the process-exporter sometimes fails - after stopping and starting the daemon multiple times I can see that sometimes the exporter "picks up" the new process and sometimes it doesn't.

dkundo commented 3 years ago

-recheck=true is supposed to solve it