feature: add an option to control whether collecting phpfpm process state metrics

hipages / php-fpm_exporter

A prometheus exporter for PHP-FPM.

Apache License 2.0

604 stars 123 forks source link

feature: add an option to control whether collecting phpfpm process state metrics #85

Closed stanxing closed 4 years ago

stanxing commented 4 years ago

The exporter expose phpfpm every process metrics in detail. But I think each container will produce very many processes a day. So even if this information is collected, it cannot be displayed in detail in the dashboard. I don't think these make sense. May be should have an option to disable to scrape process state. For example, this is my config for phpfpm pool:

pm = dynamic
pm.start_servers = 12
pm.max_children = 100
pm.process_idle_timeout = 10s
pm.min_spare_servers = 5
pm.max_spare_servers = 20

The number of processes will change automatically based on load. If keep this behavior, too many pid will be produced and collected.

estahn commented 4 years ago

If I recall correctly the Grafana dashboard we're providing as part of this project does use these metrics. However, I'm accepting a PR for this. As an alternative you can always use Prometheus to drop metrics:

https://www.robustperception.io/dropping-metrics-at-scrape-time-with-prometheus

In regards to your PHP-FPM configuration, I'd suggest running some tests with static instead of dynamic. This reduces the dynamic nature of PHP-FPM within a Pod and allows the HPA to do its job better (at least in our experience), aka Kubernetes scales the pod and PHP-FPM is not scaling anything otherwise it would just mean the resource consumption of your pod is increasingly variable.

This is the configuration from one of our services:

          - PHP_FPM_PM: static
          - PHP_FPM_PM_MAX_CHILDREN: '"10"'
          - PHP_FPM_PM_MAX_REQUESTS: '"5000"'

Edit: ^^ This is with the assumption you're running Kubernetes, which I'm realising might not be the case.

stanxing commented 4 years ago

This is with the assumption you're running Kubernetes, which I'm realising might not be the case.

As your assumption, our services are running on k8s. The reason for choosing dynamic is I think HPA can't respond quickly when a lot of traffic is coming. Also we enabled HPA with cpu and custom metric named phpfpm_active_processes

matejzero commented 4 years ago

I have problems with VictoriaMetrics because of pid_hash, which sometimes disappears and then reappear again later. Because of that, VM's stallness algorithm wrongfully interprets the values (Vm doesn't have support for Prometheus stallness marker.

My graphs for php-fpm in VM (static with max workers set to 400): Screenshot 2020-04-03 at 16 54 47

A quote from VictoriaMetrics developer regarding the problem:

Note that the pid_hash label may increase time series cardinality and churn rate for php-fpm workload with high churn rate of worker processes.

I think it would be better having phpfpm_process_state_count gauge with distinct state values instead of phpfpm_process_state metric for each worker process, since this solves the high cardinality and high churn rate issues while leaving good enough observability. Additionally, it could be great adding phpfpm_process_state_duration_seconds histogram for each state in order to track worker processes' lifecycle durations.

What do you think?

estahn commented 4 years ago

@matejzero I think phpfpm_process_state_count sounds like a good compromise.

matejzero commented 4 years ago

It should also be a lot quicker then the current metrics. It takes a few seconds in my case to render the # of processes by state.

This should also help with lowering churn rate (since pid_hash would be removed), which is label with highest cardinality in my case with 4662 new values per day on 4 servers in my case.

phpfpm_process_state is also one of the highest cardinality metric with 12k metrics/day.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.