sipcapture / heplify-server

HEP Capture Server for HOMER
https://sipcapture.org
GNU Affero General Public License v3.0
183 stars 85 forks source link

heplify-server's DB health monitoring #551

Open rnalrd opened 4 months ago

rnalrd commented 4 months ago

This is somewhat related to https://github.com/sipcapture/heplify-server/issues/338

I see that heplify-server runs even when the DB is not available, it just logs an error. This is true also when the DB is connected upon start up but goes down later on: it logs an error but heplify-server keeps on running. I haven't seen it trying to reconnect, at least not in 15-30 minutes since the connection went down.

I assume that keep on running without the DB connection is by design. My problem is: how do I monitor that heplify-server is healthy (that it is inserting data into the DB)? I could parse the log of course..

lmangani commented 4 months ago

@rnalrd heplify-server supports multiple operational methods, with or without an active database depending on the case. If you use the DB and don't see the reconnection, please collect the failure logs and share them here. You can also scrape the heplify-server metrics endpoint to monitor the service itself.

rnalrd commented 4 months ago

Here there is a log example:

heplify2:~# /usr/bin/heplify-server -config /etc/heplify-server/heplify-server.toml -logstd true -loglvl dbg -logdbg sql,help,loki
2024/05/07 08:22:16.465893 server.go:93: INFO start heplify-server 1.59.7 with config.HeplifyServer{HEPAddr:"0.0.0.0:9060", HEPTCPAddr:"", HEPTLSAddr:"0.0.0.0:9061", HEPWSAddr:"0.0.0.0:3000", ESAddr:"", ESDiscovery:true, HEPv2Enable:true, ESUser:"", ESPass:"", LokiURL:"", LokiBulk:200, LokiTimer:4, LokiBuffer:100000, LokiHEPFilter:[]int{1, 5, 100}, LokiIPPortLabels:false, LokiAllowOutOfOrder:false, ForceHEPPayload:[]int{}, PromAddr:"", PromTargetIP:"", PromTargetName:"", DBShema:"homer7", DBDriver:"postgres", DBAddr:"homer-db-a-lb:9999", DBSSLMode:"require", DBUser:"homer_user", DBPass:"<private>", DBDataTable:"homer_data", DBConfTable:"homer_config", DBBulk:200, DBTimer:4, DBBuffer:400000, DBWorker:8, DBRotate:true, DBPartLog:"2h", DBPartIsup:"6h", DBPartSip:"1h", DBPartQos:"6h", DBDropDays:7, DBDropDaysCall:0, DBDropDaysRegister:0, DBDropDaysDefault:0, DBDropOnStart:false, DBUsageProtection:true, DBUsageScheme:"percentage", DBPercentageUsage:"80%", DBMaxSize:"1024MB", DBProcDropLimit:2, Dedup:true, DiscardMethod:[]string{"REGISTER", "KDMQ", "SUBSCRIBE", "OPTIONS"}, CensorMethod:[]string(nil), AlegIDs:[]string{}, ForceALegID:false, CustomHeader:[]string{}, IgnoreCaseCH:false, SIPHeader:[]string{}, LogDbg:"sql", LogLvl:"debug", LogStd:true, LogSys:false, Config:"/etc/heplify-server/heplify-server.toml", ConfigHTTPAddr:"", ConfigHTTPPW:"", Version:false, ScriptEnable:false, ScriptEngine:"lua", ScriptFolder:"", ScriptHEPFilter:[]int{1, 5, 100}, TLSCertFolder:"/etc/heplify-server/cert", TLSMinVersion:"1.2"}
2024/05/07 08:22:16.466241 rotator.go:551: INFO start creating tables (2024-05-07 08:22:16.466137978 +0000 UTC m=+0.011684764)
2024/05/07 08:22:16.468179 rotator.go:556: ERR dial tcp [dead:beef::1016::110]:9999: connect: connection refused
2024/05/07 08:22:16.468546 rotator.go:559: ERR dial tcp [dead:beef::1016::110]:9999: connect: connection refused
2024/05/07 08:22:16.468884 rotator.go:562: ERR dial tcp [dead:beef::1016::110]:9999: connect: connection refused
2024/05/07 08:22:16.468933 rotator.go:564: INFO end creating tables (2024-05-07 08:22:16.468920233 +0000 UTC m=+0.014467016)
dial tcp [dead:beef::1016::110]:9999: connect: connection refused
2024/05/07 08:22:16.469674 rotator.go:508: ERR dial tcp [dead:beef::1016::110]:9999: connect: connection refused
2024/05/07 08:22:16.470024 server.go:152: ERR dial tcp [dead:beef::1016::110]:9999: connect: connection refused
2024/05/07 08:27:16.467087 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 08:32:16.466241 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 08:37:16.466739 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 08:42:16.467147 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 08:47:16.466461 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 08:52:16.467091 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 08:57:16.467194 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 09:02:16.466500 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 09:07:16.466773 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 09:12:16.466596 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 09:17:16.466644 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 09:22:16.466243 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 09:27:16.466541 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 09:32:16.466230 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 09:37:16.466581 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 09:42:16.466627 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 09:47:16.466206 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 09:52:16.466953 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 09:57:16.466786 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 10:02:16.467111 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 10:07:16.466561 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 10:12:16.466591 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 10:17:16.467175 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 10:22:16.466675 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 10:27:16.467079 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2024/05/07 10:32:16.466190 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0

Where is the heplify-server metrics endpoint?

lmangani commented 3 months ago

The prometheus /metrics endpoint is exposed on the port configured in the PromAddr parameter

PromAddr = "0.0.0.0:8899"

rnalrd commented 3 months ago

Thank you.

I'm not seeing anything here that shows the status of the DB connection, am I missing something?

heplify:~# curl http://localhost:8999/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.000108652
go_gc_duration_seconds{quantile="0.25"} 0.000108652
go_gc_duration_seconds{quantile="0.5"} 0.000159544
go_gc_duration_seconds{quantile="0.75"} 0.000159544
go_gc_duration_seconds{quantile="1"} 0.000159544
go_gc_duration_seconds_sum 0.000268196
go_gc_duration_seconds_count 2
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 91
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.22.2"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 7.67784e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 1.1804152e+07
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.451349e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 46889
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 4.139774966066865e-05
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 2.703144e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 7.67784e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 5.750784e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 9.781248e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 6835
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 3.039232e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 1.5532032e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.7158324088897762e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 53724
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 38400
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 46800
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 157280
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 179520
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 1.3298704e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 2.522235e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 1.245184e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 1.245184e+06
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 2.3680264e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 18
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.94
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 4096
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 13
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 2.7734016e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71583238465e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.282646016e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes -1
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0