jcmoraisjr / haproxy-ingress

HAProxy Ingress
https://haproxy-ingress.github.io
Apache License 2.0
1.03k stars 269 forks source link

Premature termination of haproxy process #1028

Closed tumd closed 3 months ago

tumd commented 11 months ago

Description of the problem

We are running haproxy-ingress in GKE, fronted by their default/legacy target pool-based lb. However I do think this is likely to be an issue in other environments too.

For quite some time we have noticed short outages on our services behind haproxy-ingress when it is being redeployed.

A note about GCP LB; GCP LB sets up health-checks performing checks with an interval of 8 seconds, and sets the backend as unhealthy after 3 failed checks. I'm not that versed in k8s, but these health-checks probes a public port on each node reporting the (termination) status of the local running haproxy-ingress.

In this caase this means that we would need haproxy-ingress to remain for at least 8*3 seconds before it terminates, to allow for the GCP LB to register the terminating instance as unhealthy.

The haproxy-ingress cmd-line option --wait-before-shutdown set to 30 (or something higher) would supposedly help with this issue. You would also need to set the spec.template.spec.terminationGracePeriodSeconds to something like 60 to allow the pod more time to gracefully terminate.

But here in lies the problem. Even with --wait-before-shutdown=30 the haproxy process inside the pod exits immediately upon pod termination - even though the haproxy-ingress-controller process remains for 30s.

Bellow is a log from a haproxy-ingress pod that just got a termination signal.

Log ``` ... 10.40.0.23:39086 [13/Oct/2023:12:49:29.249] _front_https~ app-dev_file-service_8080/srv001 0/0/0/2/2 200 237 - - ---- 6/4/0/0/0 0/0 {} {xxx} "GET https://app.xy/monitor/status HTTP/2.0" [WARNING] (21) : Exiting Master process... [NOTICE] (21) : haproxy version is 2.4.24-d175670 [ALERT] (21) : Current worker #1 (23) exited with code 143 (Terminated) [WARNING] (21) : All workers exited. Exiting... (0) I1013 12:49:29.351729 7 main.go:47] Shutting down with signal terminated I1013 12:49:29.351798 7 controller.go:269] Waiting 1m30s before stopping components I1013 12:49:29.444615 7 controller.go:339] starting haproxy update id=37 I1013 12:49:29.444793 7 converters.go:67] applying 4 change notifications: [update/Endpoints:ingress-controller/haproxy-ingress-metrics update/Endpoints:ingress-controller/haproxy-ingresss update/Pod:ingress-controller/haproxy-ingress-55847447bf-7svbh update/Pod:ingress-controller/haproxy-ingress-55847447bf-7svbh] I1013 12:49:29.444826 7 ingress.go:232] syncing 0 host(s) and 0 backend(s) I1013 12:49:29.444871 7 instance.go:441] updating 0 host(s): [] I1013 12:49:29.444892 7 instance.go:458] updating 0 backend(s): [] I1013 12:49:29.445053 7 instance.go:358] old and new configurations match I1013 12:49:29.445137 7 controller.go:348] finish haproxy update id=37: parse_ingress=0.199073ms write_maps=0.018148ms total=0.217221ms I1013 12:49:29.450632 7 instance.go:655] haproxy stopped E1013 12:49:29.573257 7 instance.go:241] error reading admin socket: error connecting to /var/run/haproxy/admin.sock: dial unix /var/run/haproxy/admin.sock: connect: connection refused E1013 12:49:30.073784 7 instance.go:241] error reading admin socket: error connecting to /var/run/haproxy/admin.sock: dial unix /var/run/haproxy/admin.sock: connect: connection refused E1013 12:49:30.577157 7 instance.go:241] error reading admin socket: error connecting to /var/run/haproxy/admin.sock: dial unix /var/run/haproxy/admin.sock: connect: connection refused E1013 12:49:31.080171 7 instance.go:241] error reading admin socket: error connecting to /var/run/haproxy/admin.sock: dial unix /var/run/haproxy/admin.sock: connect: connection refused I1013 12:49:31.445598 7 controller.go:339] starting haproxy update id=38 I1013 12:49:31.446402 7 converters.go:67] applying 2 change notifications: [update/Endpoints:ingress-controller/haproxy-ingresss update/Endpoints:ingress-controller/haproxy-ingress-metrics] I1013 12:49:31.446435 7 ingress.go:232] syncing 0 host(s) and 0 backend(s) I1013 12:49:31.446478 7 instance.go:441] updating 0 host(s): [] I1013 12:49:31.446488 7 instance.go:458] updating 0 backend(s): [] I1013 12:49:31.446526 7 instance.go:358] old and new configurations match I1013 12:49:31.446554 7 controller.go:348] finish haproxy update id=38: parse_ingress=0.163445ms write_maps=0.018770ms total=0.182215ms E1013 12:49:31.581484 7 instance.go:241] error reading admin socket: error connecting to /var/run/haproxy/admin.sock: dial unix /var/run/haproxy/admin.sock: connect: connection refused E1013 12:49:32.082324 7 instance.go:241] error reading admin socket: error connecting to /var/run/haproxy/admin.sock: dial unix /var/run/haproxy/admin.sock: connect: connection refused E1013 12:49:32.587343 7 instance.go:241] error reading admin socket: error connecting to /var/run/haproxy/admin.sock: dial unix /var/run/haproxy/admin.sock: connect: connection refused E1013 12:49:33.088468 7 instance.go:241] error reading admin socket: error connecting to /var/run/haproxy/admin.sock: dial unix /var/run/haproxy/admin.sock: connect: connection refused E1013 12:49:33.642636 7 instance.go:241] error reading admin socket: error connecting to /var/run/haproxy/admin.sock: dial unix /var/run/haproxy/admin.sock: connect: connection refused [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:130]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:147]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:164]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:180]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:215]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:232]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:288]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:312]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:329]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:346]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:383]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:400]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:457]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:475]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:507]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:524]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:561]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:578]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:634]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:651]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:683]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:699]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:734]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:751]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:805]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:821]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:837]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:852]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:885]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. [WARNING] (220) : parsing [/etc/haproxy/haproxy.cfg:901]: 'option httpchk' : hiding headers or body at the end of the version string is deprecated. Please, consider to use 'http-check send' directive instead. E1013 12:49:34.143576 7 instance.go:241] error reading admin socket: error connecting to /var/run/haproxy/admin.sock: dial unix /var/run/haproxy/admin.sock: connect: connection refused E1013 12:49:34.644169 7 instance.go:241] error reading admin socket: error connecting to /var/run/haproxy/admin.sock: dial unix /var/run/haproxy/admin.sock: connect: connection refused E1013 12:49:35.144684 7 instance.go:241] error reading admin socket: error connecting to /var/run/haproxy/admin.sock: dial unix /var/run/haproxy/admin.sock: connect: connection refused E1013 12:49:35.650237 7 instance.go:241] error reading admin socket: error connecting to /var/run/haproxy/admin.sock: dial unix /var/run/haproxy/admin.sock: connect: connection refused [NOTICE] (220) : New worker #1 (222) forked 10.40.0.23:58602 [13/Oct/2023:12:49:35.871] _front_https~ app-dev_file-service_8080/srv001 0/0/1/5/6 200 237 - - ---- 1/1/0/0/0 0/0 {} {xxx} "GET https://app.xy/monitor/status HTTP/2.0" ``` ...

As you can see the haproxy process immediately terminates, while the parent haproxy-ingress-controller starts it up again a few seconds later after it fails to reach it's admin socket.

Expected behavior

I expect the haproxy process not to react to the SIGTERM signal and instead be stopped by haproxy-ingress-controller after the specified timeout.

Steps to reproduce the problem

  1. Setup haproxy-ingress with the arg's --master-worker --wait-before-shutdown=30
  2. Look at the processlist; kubectl --context k3s -n ingress-controller exec <pod> -- ps -o pid,pgid,args
  3. Terminate the pod and watch the processlist output;
    kubectl --context k3s -n ingress-controller delete pod <pod> --wait=false ; \
    while true; do
     date --iso-8601=seconds && \
     kubectl --context k3s -n ingress-controller exec <pod> -- ps -o pid,pgid,args && \
     echo && sleep 1;
    done;
Example output ``` ❯ kubectl --context k3s -n ingress-controller exec haproxy-ingress-854d468d74-f7qmw -- ps -o pid,pgid,args && kubectl --context k3s -n ingress-controller delete pod haproxy-ingress-854d468d74-f7qmw --wait=false ; while true; do date --iso-8601=seconds && kubectl --context k3s -n ingress-controller exec haproxy-ingress-854d468d74-f7qmw -- ps -o pid,pgid,args && echo && sleep 1; done; PID PGID COMMAND 1 1 /usr/bin/dumb-init -- /start.sh --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=30 --update-status-on-shutdown 7 7 /haproxy-ingress-controller --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=30 --update-status-on-shutdown 20 7 haproxy -W -S /var/run/haproxy/master.sock,mode,600 -f /etc/haproxy 22 7 haproxy -W -S /var/run/haproxy/master.sock,mode,600 -f /etc/haproxy 33 33 ps -o pid,pgid,args pod "haproxy-ingress-854d468d74-f7qmw" deleted 2023-10-16T15:53:55+02:00 PID PGID COMMAND 1 1 /usr/bin/dumb-init -- /start.sh --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=30 --update-status-on-shutdown 7 7 /haproxy-ingress-controller --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=30 --update-status-on-shutdown 39 39 ps -o pid,pgid,args 2023-10-16T15:53:56+02:00 PID PGID COMMAND 1 1 /usr/bin/dumb-init -- /start.sh --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=30 --update-status-on-shutdown 7 7 /haproxy-ingress-controller --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=30 --update-status-on-shutdown 46 46 ps -o pid,pgid,args 2023-10-16T15:53:57+02:00 PID PGID COMMAND 1 1 /usr/bin/dumb-init -- /start.sh --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=30 --update-status-on-shutdown 7 7 /haproxy-ingress-controller --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=30 --update-status-on-shutdown 52 52 ps -o pid,pgid,args 2023-10-16T15:53:58+02:00 PID PGID COMMAND 1 1 /usr/bin/dumb-init -- /start.sh --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=30 --update-status-on-shutdown 7 7 /haproxy-ingress-controller --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=30 --update-status-on-shutdown 58 58 ps -o pid,pgid,args 2023-10-16T15:53:59+02:00 PID PGID COMMAND 1 1 /usr/bin/dumb-init -- /start.sh --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=30 --update-status-on-shutdown 7 7 /haproxy-ingress-controller --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=30 --update-status-on-shutdown 64 7 haproxy -W -S /var/run/haproxy/master.sock,mode,600 -f /etc/haproxy 66 7 haproxy -W -S /var/run/haproxy/master.sock,mode,600 -f /etc/haproxy 70 70 ps -o pid,pgid,args ```

One important part in the above output is the column PGID which shows the pid group of the processes.

The SIGTERM signal is received by dumb-init that in turn passes the signal down to it's child(s). Because haproxy-ingress-controller-process and the haproxy-processes itself lives in the same pid group they both reacts to the SIGTERM. I think what needs to be done is the haproxy-proccess needs to be decoupled from the same process group of haproxy-ingress-controller.

I think this answer may be helpful. https://stackoverflow.com/a/35435038/4496802

Environment information

HAProxy Ingress version: v0.14.5

Command-line options:

--default-backend-service=$(POD_NAMESPACE)/default-backend
--default-ssl-certificate=$(POD_NAMESPACE)/haproxy-default-cert
--configmap=$(POD_NAMESPACE)/haproxy-ingress
--sort-backends
--acme-server
--acme-track-tls-annotation
--master-worker
--wait-before-shutdown=30

Global options:

ssl-redirect: "true"
prometheus-port: "9101"
ssl-dh-default-max-size: "2048"
max-connections: "2000"
timeout-server: "500s"
timeout-server-fin: "500s"
config-frontend: |
  capture response header X-Request-ID len 36
  declare capture request len 1
  capture cookie xxx len 32
syslog-endpoint: stdout
syslog-length: "20480"
syslog-format: raw
acme-emails: ncc@example.com
acme-endpoint: v2
acme-terms-agreed: "true"

Ingress objects:

apiVersion: v1
kind: Service
metadata:
  name: haproxy-ingresss
  namespace: ingress-controller
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  # loadBalancerIP: foo.bar.baz.qux
  ports:
  - port: 80
    targetPort: 80
    name: http
  - port: 443
    targetPort: 443
    name: https
  selector:
    app: haproxy-ingress
avmnu-sng commented 11 months ago

Since doing dumb-init, we can rewrite the signal TERM (15) to USR1 (10). With this, HAProxy now performs the graceful shutdown. Additionally:

https://github.com/jcmoraisjr/haproxy-ingress/blob/cf9d05a07e313e17173ec7439ca0e40fa9f04fcf/builder/Dockerfile#L46 to

ENTRYPOINT ["/usr/bin/dumb-init", "--rewrite", "15:10", "--", "/start.sh"]
tumd commented 11 months ago

The problem I'm describing is not solved by translating the signal sent from dumb-init.

The haproxy process itself should not receive whatever signal dumb-init is sending at all. haproxy should only be receiving signals directly from haproxy-ingress-controller. haproxy-ingress-controller should basically implement what dumb-init is doing - i.e. spawn it's children (haproxy) in a separate process group and manage their lives. This would enable the possibility to translate whatever signal should be sent to haproxy. And --wait-before-shutdown would work as expected too.

I would argue that dumb-init should not be needed at all, as it's essential functionality should be managed by haproxy-ingress-controller instead.

jcmoraisjr commented 10 months ago

Thanks for such a detailed description. I can reproduce the problem when --master-worker is enabled, it seem that haproxy handles kernel signals differently when running in master worker mode, and because of that we didn't see such problem before. I can easily fix that by changing how dumb-init is configured:

ENTRYPOINT ["/usr/bin/dumb-init", "--single-child", "--", "/start.sh"]

This seems to fix at least half of the problems you're facing. Do you happen to have the ability to deploy the controller that way in order to see how does you env behaves? This can be made by configuring command: in the container declaration (controller's helm chart doesn't have an option for that) or creating a new image that overwrites it.

The second half of the problem is regarding the way your lb is retrieving controller's endpoints. If it uses the LoadBalancer typed service you provided above, we should be good, because I can confirm that the service's endpoint resource is updated as soon as the pod is going to be deleted, despite of the wait-shutdown config, and you can double check it by kubectl get ep -n ingress-controller -w while rolling updating or deleting pods.

If you have other means to provide the controller's endpoints to your fronting lb, please let me know, this info can be used to improve HA configs and docs for haproxy ingress.

Regarding dumb-init: linux kernel manages pid1 differently of other process, and dumb-init knows how to deal with it. We can get rid of it in the future but we need to revisit a few scenarios before the change. In the mean time --single-child seems to do what we need that should be done.

I'm going to create a new pr that changes the entrypoint in that way, along with a few more ha related docs. Thanks again for giving this issue to our attention.

avmnu-sng commented 10 months ago

Yeah, --single-child should solve the problem. I'll try to test it out as well. However, we should update the HA process termination, by sending USR1 and not KILL.

jcmoraisjr commented 10 months ago

We do not have the time to send USR1 and wait. When haproxy receives the signal, all the other phases had already gone and we simply need to turn the lights off, close the door and go away fast. Even if the current TERM would be changed to the proposed USR1, the host's kernel would KILL/9 it because the pid1 of the container, dumb-init, is finishing because its child process, haproxy-ingress, has just finished as well.

What we need to do instead is to prevent new requests to income, by notifying any fronting infra that we cannot receive new requests anymore. This is all about this issue and the comment I sent earlier today.

avmnu-sng commented 10 months ago

So, when we invoke kill HA, the ingress controller ensures all the ongoing connections have finished? If that's the case, I've missed this logic when going through the code.

jcmoraisjr commented 10 months ago

This doesn't work that way. You need to configure a time frame based on the time you expect your requests to last, and the controller will stop the proxy on that time frame.

I see your point, we could instead configure a timeout and the controller would stop as soon as the last request has finished. We don't have that logic in place. The only approach today is to ask the controller to wait a fixed amount of time, just like some sort of "don't check haproxy, just wait until the timeout expires and quit after that".

avmnu-sng commented 10 months ago

I see the gap now. I used the graceful shutdown as "all the HA requests were completed before quitting HA." We can use ha-shutdown-timeout as this timeout because when the last request finishes depends upon how various timeouts are set in haproxy.cfg. So, only the user can tell how long to wait. When dumb-init receives the TERM:

Or, we can wait indefinitely for HA to exit gracefully after issuing USR1. I guess the controller takes care of not taking any more requests after receiving TERM.

jcmoraisjr commented 10 months ago

Waiting indefinitely doesn't look like an option if you consider the way Kubernetes work. You need to provide a timeout, which defaults to 30s and haproxy ingress Helm chart changes to 60s. Note that this might stuck your rolling update process depending on the deployment configuration. On top of that I'd also say that waiting for more than 1m doesn't seem like a good pattern. Your http requests should be fast.

If you're worried about websocket, haproxy gracefully closes all the running connections, so clients have a chance to connect again. So this happens without any issue for clients.

Note also that this only applies on rolling updates, which shouldn't happen all the time. The day to day reloads of haproxy to apply a new configuration already uses USR1, so an old process is removed from the memory as soon as all the requests finishes.

jcmoraisjr commented 10 months ago

Folks I did a few more tests with preStop:

helm chart:

controller:
  ...
  lifecycle:
    preStop:
      exec:
        command: ["sleep", "20"]
  ...
  haproxy: # in the case external haproxy is needed
    ...
    lifecycle:
      preStop:
        exec:
          command: ["sleep", "20"]

This makes the SIGTERM signal to delay 20s to be sent to the containers, after removing the pod IP from the endpoint resource. No need to configure wait-shutdown. This means that, if the load balancer follows the haproxy ingress service, the endpoint will be removed as soon as the pod starts to be deleted, and waiting 20s to receive the SIGTERM signal and in fact terminates.

avmnu-sng commented 10 months ago

We must also consider the server and backend queues and the kernel socket queues. We use backend queues and did a quick test with sleep 60, but still, I see many errors. Can you elaborate "If the load balancer follows the haproxy ingress service, the endpoint will be removed as soon as the pod starts to be deleted"?

I guess we need --single-child with preStop as well.

nodefactory-bk commented 10 months ago

Guys, it seems you are reaching for lots of workarounds instead of starting with addressing the core issue as pointed out by @tumd .

It's pretty standard unix behaviour to send a signal to a process group, and not just a process, also note that the process group leader has the same pid as the group id so sending a signal to that process is the same as sending it to the group. Which also is exactly what dumb-init does (https://github.com/Yelp/dumb-init):

In its default mode, dumb-init establishes a session rooted at the child, and sends signals to the entire process group. This is useful if you have a poorly-behaving child (such as a shell script) which won't normally signal its children before dying.

This means that if haproxy-ingress-controller started ha-proxy and didn't explicitly make sure it ends up in its own process group both will receive the SIGTERM from dumb-init.

For details see Stevens, Advanced Programming in the UNIX Environment, chapter about process groups and signals and deamon processes (chapter 9,10 and 13 in 2013-edition). BTW that book should be required reading for anyone doing anything nontrivial under unixlike systems.

The proper thing to do is for haproxy-ingress-controller to ensure that haproxy runs in its own process group (and session), see the earlier linked stackoverflow post: https://stackoverflow.com/questions/35433741/in-golang-prevent-child-processes-to-receive-signals-from-calling-process/35435038#35435038

And then when controller receives SIGTERM from dumb-init it starts whatever it needs to do to prepare for shutdown, waits for an assigned timeout and sends its own SIGTERM to haproxy when it is good and ready for it.

No translation of signals should be needed.

Thus in short the sequence would be:

  1. K8s sends SIGTERM
  2. dumb-init sends it on to haproxy-ingress-controller
  3. controller does whatever controller does while haproxy still proxies.
  4. controller sends SIGTERM (or whatever it wishes (USR1??)) to haproxy when everything is ready to shut down.
  5. All terminate.
avmnu-sng commented 10 months ago

@nodefactory-bk As @jcmoraisjr mentioned using --single-child which address the premature termination. With this it should follow the sequence you are talking about.

We are discussing the next part on how to allow HA finish up all the pending jobs in a timely manner.

tumd commented 10 months ago

Thanks for such a detailed description. I can reproduce the problem when --master-worker is enabled, it seem that haproxy handles kernel signals differently when running in master worker mode, and because of that we didn't see such problem before. I can easily fix that by changing how dumb-init is configured:

ENTRYPOINT ["/usr/bin/dumb-init", "--single-child", "--", "/start.sh"]

This seems to fix at least half of the problems you're facing. Do you happen to have the ability to deploy the controller that way in order to see how does you env behaves? This can be made by configuring command: in the container declaration (controller's helm chart doesn't have an option for that) or creating a new image that overwrites it.

Regarding dumb-init: linux kernel manages pid1 differently of other process, and dumb-init knows how to deal with it. We can get rid of it in the future but we need to revisit a few scenarios before the change. In the mean time --single-child seems to do what we need that should be done.

Adding --single-child as argument to dumb-init via command: does seem to work as a workaround. This puts dumb-init, haproxy-ingress-controller and all haproxy processes in the same pid-group (1).

ps output;

$ kubectl -n ingress-controller exec haproxy-ingress-567df674bd-4pwxb -- ps -o pid,pgid,args
PID   PGID  COMMAND
    1     1 /usr/bin/dumb-init --single-child -- /start.sh --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=50
    7     1 /haproxy-ingress-controller --default-backend-service=ingress-controller/default-backend --default-ssl-certificate=ingress-controller/haproxy-default-cert --configmap=ingress-controller/haproxy-ingress --sort-backends --acme-server --acme-track-tls-annotation --master-worker --wait-before-shutdown=50
   21     1 haproxy -W -S /var/run/haproxy/master.sock,mode,600 -f /etc/haproxy
   23     1 haproxy -W -S /var/run/haproxy/master.sock,mode,600 -f /etc/haproxy
   29    29 ps -o pid,pgid,args

It's a little bit strange that the SIGTERM sent by k8s isn't received by the whole pid-group - which was the case without --single-child arg . But I guess that's one of the peculiarities with PID and pid-group 1.

As mentioned above, this is simply a workaround - and the correct fix would be to handle this correctly in the controller-process, as described by @nodefactory-bk.

The second half of the problem is regarding the way your lb is retrieving controller's endpoints. If it uses the LoadBalancer typed service you provided above, we should be good, because I can confirm that the service's endpoint resource is updated as soon as the pod is going to be deleted, despite of the wait-shutdown config, and you can double check it by kubectl get ep -n ingress-controller -w while rolling updating or deleting pods.

If you have other means to provide the controller's endpoints to your fronting lb, please let me know, this info can be used to improve HA configs and docs for haproxy ingress.

Yes, this seem to be the case (in our case) when using a LoadBalancer typed service. The fronting GCP LB probes each node on a specific port where some k8s service is listening. The status of this endpoint seem to follow the status of the node-local haproxy-pod. My guess is that this endpoint is controlled by the LoadBalancer typed service.

And as mentioned earlier; It's important that haproxy lingers untouched for --wait-before-shutdown-amount of time, to allow for the fronting GCP LB to register the terminating pod as "UNHEALTY".

jcmoraisjr commented 10 months ago

Hi, a few doubts before I can move all the HA related updated forward:

tumd commented 10 months ago

Yes, fronting lb (GCP LB) does use the Loadbalancer typed haproxy-service as the source for its health-checks, seemingly by probing all nodes on the port defined in healthCheckNodePort on the service.

An example from one of our clusters in GKE;

❯ kubectl --context testing -n ingress-controller get services haproxy-ingresss -o custom-columns="healthCheckNodePort:spec.healthCheckNodePort,type:spec.type"
healthCheckNodePort   type
31916                 LoadBalancer
❯ kubectl --context testing get nodes -o custom-columns="INTERNAL-IP:.status.addresses[0].address"
INTERNAL-IP
10.132.0.50
10.132.0.60
10.132.0.80
10.132.0.74
10.132.0.61
10.132.0.62
❯ curl -v 10.132.0.50:31916
* processing: 10.132.0.50:31916
*   Trying 10.132.0.50:31916...
* Connected to 10.132.0.50 (10.132.0.50) port 31916
> GET / HTTP/1.1
> Host: 10.132.0.50:31916
> User-Agent: curl/8.2.1
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< Content-Type: application/json
< X-Content-Type-Options: nosniff
< Date: Thu, 02 Nov 2023 09:54:29 GMT
< Content-Length: 138
<
{
        "service": {
                "namespace": "ingress-controller",
                "name": "haproxy-ingresss"
        },
        "localEndpoints": 0,
        "serviceProxyHealthy": true
* Connection #0 to host 10.132.0.50 left intact
}
❯ curl -v 10.132.0.60:31916
* processing: 10.132.0.60:31916
*   Trying 10.132.0.60:31916...
* Connected to 10.132.0.60 (10.132.0.60) port 31916
> GET / HTTP/1.1
> Host: 10.132.0.60:31916
> User-Agent: curl/8.2.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< X-Content-Type-Options: nosniff
< Date: Thu, 02 Nov 2023 09:54:33 GMT
< Content-Length: 138
<
{
        "service": {
                "namespace": "ingress-controller",
                "name": "haproxy-ingresss"
        },
        "localEndpoints": 1,
        "serviceProxyHealthy": true
* Connection #0 to host 10.132.0.60 left intact
}

Why do you folks consider changing the group id better than asking dumb-init to ignore it at all?

Because it's the proper way to handle parent/child relationships where the parent is responsible for killing its children. Also it will then work when started from other things than dumb-init in combination with its --single-child arg.

jcmoraisjr commented 3 months ago

Changed the pid group of the haproxy process when started in master-worker mode. This should be available in the next v0.14 and v0.15 releases. Thanks folks for all the discussion in this subject. Closing.