perforce / p4prometheus

[Community Supported] Perforce (Helix Core) interface for writing Prometheus metrics from real-time analysis of p4d log files.
MIT License
48 stars 24 forks source link

p4prometheus v0.7.5 may have bug for arguments #21

Closed jakeRM closed 1 year ago

jakeRM commented 3 years ago

I am not sure if potentially the required arguments have changed, or if something else is occuring, but after updating one of the perforce instances to the latest p4prometheus release, I am seeing this error frequently:

[root@euodyl-prf001 ~]# systemctl status p4prometheus -l
● p4prometheus.service - P4prometheus
   Loaded: loaded (/etc/systemd/system/p4prometheus.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2021-11-04 12:07:01 PDT; 5 days ago
  Process: 127327 ExecStart=/usr/local/bin/p4prometheus.linux-amd64 --config=/p4/common/config/p4prometheus.yml (code=exited, status=2)
 Main PID: 127327 (code=exited, status=2)

Nov 02 09:22:01 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-02T09:22:01-07:00" level=info msg="P4Prometheus config: &{Debug:0 ServerID:standby2 SDPInstance:1 UpdateInterval:15s OutputCmdsByUser:true OutputCmdsByUserRegex: OutputCmdsByIP:true CaseSensitiveServer:true}"
Nov 02 09:22:01 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-02T09:22:01-07:00" level=info msg="watching new file" directory=/p4/1/logs fd=3 file=log
Nov 02 15:03:10 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-02T15:03:10-07:00" level=info msg="Resetting running to 8 as encountered server threads message"
Nov 03 05:00:02 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-03T05:00:02-07:00" level=info msg="watching new file" fd=7 file=log
Nov 03 05:00:02 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-03T05:00:02-07:00" level=info msg="file was removed, closing and un-watching" fd=3 file=log
Nov 04 05:00:01 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-04T05:00:01-07:00" level=info msg="watching new file" fd=3 file=log
Nov 04 05:00:01 euodyl-prf001 p4prometheus.linux-amd64[127327]: time="2021-11-04T05:00:01-07:00" level=info msg="file was removed, closing and un-watching" fd=7 file=log
Nov 04 12:07:01 euodyl-prf001 systemd[1]: p4prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 04 12:07:01 euodyl-prf001 systemd[1]: Unit p4prometheus.service entered failed state.
Nov 04 12:07:01 euodyl-prf001 systemd[1]: p4prometheus.service failed.

This is my config file

[root@euodyl-prf001 ~]# cat /p4/common/config/p4prometheus.yml 
# Configuration file for p4prometheus
# Normally stored in /p4/common/config

log_path: /p4/1/logs/log
metrics_output: /srv/prometheus/node_exporter/p4_cmds.prom
server_id:
sdp_instance: 1
output_cmds_by_user: true

Any idea what may be causing this?

rcowham commented 3 years ago

Try running it on CLI:

/usr/local/bin/p4prometheus.linux-amd64 --config=/p4/common/config/p4prometheus.yml

Can also try falg: --debug

See what it outputs. Check which user account it is being run as.

rcowham commented 2 years ago

@jakeRM any news?

jakeRM commented 2 years ago

Sorry I haven't had a chance to run in debug, doing that today. Here is my service file definition for reference

[root@euodyl-prf001 ~]# cat /etc/systemd/system/p4prometheus.service 
[Unit]
  Description=P4prometheus
  Wants=network-online.target
  After=network-online.target

[Service]
  User=perforce
  Group=perforce
  Type=simple
  ExecStart=/usr/local/bin/p4prometheus.linux-amd64 --config=/p4/common/config/p4prometheus.yml

[Install]
  WantedBy=multi-user.target

I will report back after I have run it with debug

jakeRM commented 2 years ago

@rcowham - The service crashed again and debug was enabled, but it did not capture any useful information. is there any other flags that should be passed to see why it may be crashing?

[root@euodyl-prf001 ~]# journalctl -l -u p4prometheus.service -n 30
-- Logs begin at Fri 2022-01-07 20:10:04 PST, end at Mon 2022-01-10 11:22:02 PST. --
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-09-20_05-00-02
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-10-12_05-00-01
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-10-22_05-00-01
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=journal.6554.bak
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-09-21_05-00-01
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=log.2021-12-13_05-00-02.gz
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-10-25_05-00-01
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-12-29_05-00-01
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-09-22_05-00-02
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=log.2021-12-24_05-00-01.gz
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-10-27_05-00-01
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=log.2022-01-02_05-00-02.gz
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-09-23_05-00-02
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-10-10_05-00-01
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-10-26_05-00-01
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-09-24_05-00-02
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=log.2021-12-17_05-00-01.gz
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2022-01-02_05-00-02
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-09-26_05-00-01
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-10-11_05-00-01
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-10-28_05-00-01
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=p4verify.log
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-09-27_05-00-02
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=log.2022-01-04_05-00-02.gz
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=journal.6832.bak
Jan 08 08:53:55 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:53:55-08:00" level=debug msg="skipping file, because file name does not match" file=replica_cleanup.log.2021-10-18_05-00-01
Jan 08 08:54:02 euodyl-prf001 systemd[1]: p4prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 08 08:54:02 euodyl-prf001 systemd[1]: Unit p4prometheus.service entered failed state.
Jan 08 08:54:02 euodyl-prf001 systemd[1]: p4prometheus.service failed.
rcowham commented 2 years ago

Did you try running it on the command line?

rcowham commented 2 years ago

Preferably as same user as service, e.g. "perforce"

jakeRM commented 2 years ago

yes I did, it was giving me the same output but did not crash while it was running on the command line. It takes a few days before it crashes even when running as a service

rcowham commented 2 years ago

The error message looks like it is coming from systemd or the service script, not the executable:

https://freedesktop.org/software/systemd/man/systemd.exec.html#id-1.21.6

Which refers to:

https://refspecs.linuxbase.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html

There may be some form of syslog message for around the same time perhaps??

Also try:

systemd-analyze blame | grep -i p4prometheus

rcowham commented 2 years ago

Is it exceeding memory constraints? Any OOM killer entries in syslog around that time?

jakeRM commented 2 years ago

Aww ok I missed this earlier:

Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:54:02-08:00" level=debug msg="read line \"Perforce server info:\""
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:54:02-08:00" level=debug msg="read line \"\\t2022/01/08 08:54:02 pid 87519 p4admi
n@euodyl-prf001 10.136.132.38 [p4/2019.2/LINUX26X86_64/1999325] 'user-pull -lj'\""
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:54:02-08:00" level=debug msg="outputCompletedCommands: start 21000, end 21000, co
unt 0, continued 22522, exited 6"
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:54:02-08:00" level=debug msg="outputCompletedCommands: start 21001, end 21001, co
unt 0, continued 22523, exited 6"
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64[101502]: time="2022-01-08T08:54:02-08:00" level=debug msg="outputCompletedCommands: start 21002, end 21002, co
unt 0, continued 22524, exited 6"
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64[101502]: panic: ERROR: max running command limit (20000) exceeded. Does this server log have completion record
s configured (configurable server=3)?
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64[101502]: goroutine 14 [running]:
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64[101502]: github.com/rcowham/go-libp4dlog.(*P4dFileParser).LogParser.func4()
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64: goroutine 14 [running]:
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64: github.com/rcowham/go-libp4dlog.(*P4dFileParser).LogParser.func4()
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64: /Users/rcowham/go/pkg/mod/github.com/rcowham/go-libp4dlog@v0.9.7/p4dlog.go:1421 +0x21b
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64: created by github.com/rcowham/go-libp4dlog.(*P4dFileParser).LogParser
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64: /Users/rcowham/go/pkg/mod/github.com/rcowham/go-libp4dlog@v0.9.7/p4dlog.go:1407 +0x2f2
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64[101502]: /Users/rcowham/go/pkg/mod/github.com/rcowham/go-libp4dlog@v0.9.7/p4dlog.go:1421 +0x21b
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64[101502]: created by github.com/rcowham/go-libp4dlog.(*P4dFileParser).LogParser
Jan  8 08:54:02 euodyl-prf001 p4prometheus.linux-amd64[101502]: /Users/rcowham/go/pkg/mod/github.com/rcowham/go-libp4dlog@v0.9.7/p4dlog.go:1407 +0x2f2
Jan  8 08:54:02 euodyl-prf001 systemd[1]: p4prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Jan  8 08:54:02 euodyl-prf001 systemd[1]: Unit p4prometheus.service entered failed state.
Jan  8 08:54:02 euodyl-prf001 systemd: p4prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Jan  8 08:54:02 euodyl-prf001 systemd: Unit p4prometheus.service entered failed state.
Jan  8 08:54:02 euodyl-prf001 systemd[1]: p4prometheus.service failed.
Jan  8 08:54:02 euodyl-prf001 systemd: p4prometheus.service failed.

Is there a way to increase the running command limit?

rcowham commented 2 years ago

Currently hard coded (https://github.com/rcowham/go-libp4dlog/blob/master/p4dlog.go#L41) - so rebuild. However, it is usually a sign of failing to match up start/end records so builds up over time. This is typically one or other of:

This can also be tested by using log2sql from https://github.com/rcowham/go-libp4dlog/releases Run that on one of your log files and check for metrics, and then check the same running count in the metrics. If always increasing then a sign of a problem. If I can get a repro exceprt of the log I can fix.

jakeRM commented 2 years ago

I think its the first issue of "server=1" for this specific server. I will update accordingly and also try the log test you mentioned.

rcowham commented 1 year ago

Assuming fixed with server=3

jakeRM commented 1 year ago

that was one fix, the other was to unset track=1, which seems to add additional log messages that cannot be matched up on replicas.