redhat-developer / osd-monitor-poc

8 stars 20 forks source link

Enabling monitoring for rhche on dsaas / dsaas-stg and exposing metrics to zabbix #43

Closed ibuziuk closed 5 years ago

ibuziuk commented 5 years ago

rhche-host service on dsaas / dsaas-stg exposes 8087 port for obtaining metrics in Prometheus format:

rhche-host ClusterIP 172.30.149.180 <none> 8080/TCP,8087/TCP 52d

Currently it is possible to obtain (ClassLoader / JVM / Tomcat) metrics from osd monitor via service name & port combo. e.g curl rhche-host:8087:

image

Those metrics need to be consumed & visualized by osd monitor + exposed to zabbix. Currently the most important metrics are the following:

# HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
# TYPE jvm_threads_peak_threads gauge
jvm_threads_peak_threads 43.0
# HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads
# TYPE jvm_threads_live_threads gauge
jvm_threads_live_threads 43.0
# HELP jvm_threads_states_threads The current number of threads having NEW state
# TYPE jvm_threads_states_threads gauge
jvm_threads_states_threads{state="new",} 0.0
jvm_threads_states_threads{state="runnable",} 11.0
jvm_threads_states_threads{state="blocked",} 0.0
jvm_threads_states_threads{state="waiting",} 25.0
jvm_threads_states_threads{state="timed-waiting",} 7.0
jvm_threads_states_threads{state="terminated",} 0.0
# HELP jvm_threads_daemon_threads The current number of live daemon threads
# TYPE jvm_threads_daemon_threads gauge
jvm_threads_daemon_threads 38.0

# HELP jvm_memory_committed_bytes The amount of memory in bytes that is committed for the Java virtual machine to use
# TYPE jvm_memory_committed_bytes gauge
jvm_memory_committed_bytes{area="nonheap",id="Code Cache",} 2.3396352E7
jvm_memory_committed_bytes{area="nonheap",id="Metaspace",} 6.9337088E7
jvm_memory_committed_bytes{area="nonheap",id="Compressed Class Space",} 8519680.0
jvm_memory_committed_bytes{area="heap",id="PS Eden Space",} 1.5204352E7
jvm_memory_committed_bytes{area="heap",id="PS Survivor Space",} 524288.0
jvm_memory_committed_bytes{area="heap",id="PS Old Gen",} 4.9283072E7
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="nonheap",id="Code Cache",} 2.2897152E7
jvm_memory_used_bytes{area="nonheap",id="Metaspace",} 6.7874864E7
jvm_memory_used_bytes{area="nonheap",id="Compressed Class Space",} 8108328.0
jvm_memory_used_bytes{area="heap",id="PS Eden Space",} 1.322024E7
jvm_memory_used_bytes{area="heap",id="PS Survivor Space",} 484352.0
jvm_memory_used_bytes{area="heap",id="PS Old Gen",} 4.0086792E7
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
# TYPE jvm_memory_max_bytes gauge
jvm_memory_max_bytes{area="nonheap",id="Code Cache",} 2.5165824E8
jvm_memory_max_bytes{area="nonheap",id="Metaspace",} -1.0
jvm_memory_max_bytes{area="nonheap",id="Compressed Class Space",} 1.073741824E9
jvm_memory_max_bytes{area="heap",id="PS Eden Space",} 1.77733632E8
jvm_memory_max_bytes{area="heap",id="PS Survivor Space",} 524288.0
jvm_memory_max_bytes{area="heap",id="PS Old Gen",} 3.58088704E8

# HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool
# TYPE jvm_buffer_count_buffers gauge
jvm_buffer_count_buffers{id="direct",} 17.0
jvm_buffer_count_buffers{id="mapped",} 0.0
# HELP jvm_buffer_memory_used_bytes An estimate of the memory that the Java virtual machine is using for this buffer pool
# TYPE jvm_buffer_memory_used_bytes gauge
jvm_buffer_memory_used_bytes{id="direct",} 515770.0
jvm_buffer_memory_used_bytes{id="mapped",} 0.0
# HELP jvm_buffer_total_capacity_bytes An estimate of the total capacity of the buffers in this pool
# TYPE jvm_buffer_total_capacity_bytes gauge
jvm_buffer_total_capacity_bytes{id="direct",} 515770.0
jvm_buffer_total_capacity_bytes{id="mapped",} 0.0

# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young
generation memory pool after one GC to before the next
# TYPE jvm_gc_memory_allocated_bytes_total counter
jvm_gc_memory_allocated_bytes_total 5.75668224E8
# HELP jvm_gc_max_data_size_bytes Max size of old generation memory pool
# TYPE jvm_gc_max_data_size_bytes gauge
jvm_gc_max_data_size_bytes 3.58088704E8
# HELP jvm_gc_live_data_size_bytes Size of old generation memory pool after a full GC
# TYPE jvm_gc_live_data_size_bytes gauge
jvm_gc_live_data_size_bytes 3.9021816E7
# HELP jvm_gc_pause_seconds Time spent in GC pause
# TYPE jvm_gc_pause_seconds summary
jvm_gc_pause_seconds_count{action="end of minor GC",cause="Allocation Failure",} 37.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="Allocation Failure",} 0.444
jvm_gc_pause_seconds_count{action="end of minor GC",cause="GCLocker Initiated GC",} 1.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="GCLocker Initiated GC",} 0.006
jvm_gc_pause_seconds_count{action="end of major GC",cause="Ergonomics",} 1.0
jvm_gc_pause_seconds_sum{action="end of major GC",cause="Ergonomics",} 0.136
# HELP jvm_gc_pause_seconds_max Time spent in GC pause
# TYPE jvm_gc_pause_seconds_max gauge
jvm_gc_pause_seconds_max{action="end of minor GC",cause="Allocation Failure",} 0.045
jvm_gc_pause_seconds_max{action="end of minor GC",cause="GCLocker Initiated GC",} 0.006
jvm_gc_pause_seconds_max{action="end of major GC",cause="Ergonomics",} 0.136
# HELP jvm_gc_memory_promoted_bytes_total Count of positive increases in the size of the old generation memory pool before GC to after GC
# TYPE jvm_gc_memory_promoted_bytes_total counter
jvm_gc_memory_promoted_bytes_total 1.0042528E7

# HELP process_files_max_files The maximum file descriptor count
# TYPE process_files_max_files gauge
process_files_max_files 1048576.0
# HELP process_files_open_files The open file descriptor count
# TYPE process_files_open_files gauge
process_files_open_files 77.0

# HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process
# TYPE process_cpu_usage gauge
process_cpu_usage 0.002553191489361702
# HELP system_cpu_count The number of processors available to the Java virtual machine
# TYPE system_cpu_count gauge
system_cpu_count 4.0
# HELP system_cpu_usage The "recent cpu usage" for the whole system
# TYPE system_cpu_usage gauge
system_cpu_usage 0.13787234042553193

Number of live threads is currently by far the most important metric since it would allow to investigate P1 issue - https://github.com/openshiftio/openshift.io/issues/4626

ibuziuk commented 5 years ago

cc: @fche

fche commented 5 years ago

@ibuziuk just to clarify, is the only deliverable still outstanding in this PR the outgoing relay of some metrics to zabbix?

ibuziuk commented 5 years ago

@fche correct, the only thing that is missing is exposing the metrics to zabbix

aditya-konarde commented 5 years ago

Not sure if this issue is relevant anymore since we've deployed a prometheus instance specific to rhche

ibuziuk commented 5 years ago

Clsoing