perforce / p4prometheus

[Community Supported] Perforce (Helix Core) interface for writing Prometheus metrics from real-time analysis of p4d log files.
MIT License
48 stars 24 forks source link

The cronjobs (monitor_metrics*) get stuck and keep spawning when replica is taking a checkpoint #23

Open hueyvle opened 2 years ago

hueyvle commented 2 years ago

I have to remove the p4prom installation and the cron jobs from replica because of this.

I have a quite a huge environment where taking a checkpoint would take 12h + During that time, all perforce command would stuck, including the cron jobs.

Any work around for this?

rcowham commented 2 years ago

Which version of the server are you running? If it is p4d 2021.1 or later then we can turn on Realtime Monitoring for the server: https://www.perforce.com/manuals/cmdref/Content/CmdRef/p4_monitor.html The value of 'rtv.db.ckp.active' could be read by the script to detect checkpoint. If that's an option for you I am happy to look at putting detection for this situation in place. Please note that we usually recommend doing offline checkpoints, as done by SDP. This avoids locking live database, even on a replica. https://community.perforce.com/s/article/2419 SDP: https://swarm.workshop.perforce.com/projects/perforce-software-sdp

hueyvle commented 2 years ago

We have p4d 2015.2. the size of db.* are huge (over 100G). We have p4prometheus installed on master, and runs without any issue. However when installing it on replica, we realized that everything is frozen when checkpoint is running. (somewhat expected)

If monitor_metrics* script could detect the locking and stop spawning new job, that'd work.

rcowham commented 2 years ago

Do you have lslocks installed?

hueyvle commented 2 years ago

yes I do.