When haproxy reloads the collect metrics with prometheus seems to restart.

gusnakada commented 2 years ago

Description of the problem

I'm facing an issue with haproxy that not collecting some metrics for a while after a full reload.

No matter what happens when a full reload is executed the metrics seems to falling, pointing out that not all metrics are affected, metrics like haproxy_backend_http_responses_total or haproxy_frontend_current_sessions are affect.

In our environment, we're not set any reload strategy, so i believe that we're using reusesocket parameter by default.

So I was wondering if this an expected behavior after a full reload, when the listening sockets are copied from the old instance to the new one and creating a new process to scrape with prometheus.

Steps to reproduce the problem

When haproxy reloads the collect metrics with prometheus seems to restart and causing a gap on the graphic timeline .
Haproxy metrics like haproxy_backend_http_responses_total or haproxy_frontend_current_sessions are affected

Environment information

HAProxy Ingress version: v0.13.4

log
I0128 22:24:25.534787       6 controller.go:326] finish haproxy update id=3799399: parse_ingress=0.169054ms write_maps=0.022186ms total=0.191240ms
2022-01-28 19:24:25 -03:00  
I0128 22:24:25.531389       6 instance.go:310] old and new configurations match
2022-01-28 19:24:25 -03:00  
I0128 22:24:25.526583       6 controller.go:317] starting haproxy update id=3799399
2022-01-28 19:24:25 -03:00  
I0128 22:24:25.526452       6 instance.go:335] haproxy successfully reloaded (external)
2022-01-28 19:24:25 -03:00  
I0128 22:24:25.526801       6 instance.go:372] updating 0 backend(s): []
2022-01-28 19:24:25 -03:00  
I0128 22:24:25.526734       6 ingress.go:326] syncing 0 host(s) and 0 backend(s)
2022-01-28 19:24:25 -03:00  
I0128 22:24:25.526516       6 controller.go:338] finish haproxy reload id=6273: reload_haproxy=8197.521913ms total=8197.521913ms
2022-01-28 19:24:25 -03:00  
I0128 22:24:25.526793       6 instance.go:355] updating 0 host(s): []
2022-01-28 19:24:25 -03:00  
I0128 22:24:25.526698       6 converters.go:65] applying 1 change notification: [update/endpoint:xxx]
2022-01-28 19:24:25 -03:00  
[ALERT] 027/222425 (22867) : backend 'xxx' has no server available!
2022-01-28 19:24:25 -03:00  
[WARNING] 027/222425 (22867) : Serverxx is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 1ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2022-01-28 19:24:25 -03:00  
[ALERT] 027/222425 (22867) : backend 'xxx' has no server available!
2022-01-28 19:24:25 -03:00  
[ALERT] 027/222425 (22867) : backend 'xx' has no server available!
2022-01-28 19:24:25 -03:00


 Args:
      --v=2
      --allow-cross-namespace
      --acme-server
      --acme-track-tls-annotation
      --backend-shards=96
      --buckets-response-time=0.005,0.01,0.02
      --configmap=ingress-controller/ingress-controller-blue
      --default-ssl-certificate=ingress-controller/ingress-default-tls
      --default-backend-service=ingress-controller/ingress-default-backend
      --disable-pod-list
      --rate-limit-update=0.5
      --reload-interval=1m
      --update-status=false
      --watch-ingress-without-class
      --master-socket=/var/run/haproxy/master.sock

jcmoraisjr commented 2 years ago

Hi, this is an expected behavior due to how the source of data is generated and where they are collected.

When a full reload happens - i.e. haproxy cannot apply configuration changes dynamically, like changing whitelists or changing some host/path routing - a new whole process is forked with the new configuration. Old processes will continue live until the last connection is gone or timeout stop is triggered. The problem starts here. The /metrics endpoint reads data only from the new instance, which might have a few connections due to the short age of the instance, all the other connections and source of data are in the old instances. Because of that you can see a break in the number of total sessions and in the number of connections on backends that you know that have a lot more active connections - all the remaining data are hidden on old and currently inaccessible instances. Note that you probably see, from time to time, an abrupt increase of the number of connections from backends known to hold a big number of long lived connections - this happens due to the timeout stop being triggered on old instances, closing connections with clients, and clients that needs a persistent connection will try to connect again almost at the same time.

There is already a backlog to circumvent this annoyance in the metrics, planned to v0.15 (Q2'22), and the first step to make this happen was already in the v0.14 (allow master/worker mode on embedded haproxy).

gusnakada commented 2 years ago

Hey, it's good to talk with you again.

I'll discuss with my team about this question, that's what I thought happened after read the doc about realod strategy .

Thank you for your detailed explanation and for your support as always. See ya!

jcmoraisjr commented 2 years ago

Hey there, nice to see you here as well. Regarding reload strategy I think you don't need to bother with it at all since it only affects how listening sockets are transferred from one process to the just forked one. This is like that to implement a seamless reload, you can read more about that in this nice article from HAProxy Tech. So it doesn't seem to make sense to change it. Other than that, metrics seem broken due to the old, active but unreachable processes that has part of the data. Ping me at k8s slack!

jcmoraisjr / haproxy-ingress

When haproxy reloads the collect metrics with prometheus seems to restart. #897