traefik / traefik

The Cloud Native Application Proxy
https://traefik.io
MIT License
50.62k stars 5.05k forks source link

Add TCP Health Check using SYN, SYN-ACK, and RST packets #10794

Open eos175 opened 3 months ago

eos175 commented 3 months ago

Welcome!

What did you expect to see?

I have implemented a new TCP health check functionality using the tcp-shaker library (https://github.com/tevino/tcp-shaker). This improvement uses SYN, SYN-ACK, and RST packets to perform more accurate health checks for TCP services.

You can review the implementation on my branch here.

Would love to get feedback and discuss how this can be integrated into the project.

Thanks!

nmengin commented 3 months ago

Hey @eos175.

Thanks for your suggestion.

We are interested in this issue, but we’re unsure about the use case and the traction it will receive. We are going to leave the status as kind/proposal to give the community time to check your implementation and let us know if they would like this idea.

We will reevaluate as people respond.

Conversation is time-boxed to 6 months.

eos175 commented 3 months ago

Some related issues are #5598 and #1657.

To use it, the procedure is as follows:

services:
  example_tcp_service:
    loadBalancer:
      servers:
        - address: 10.42.64.4:7445
        - address: 10.42.0.1:7445
        - address: 127.0.0.1:8448
      healthCheck:
        path: /health
        interval: "10s"
        timeout: "3s"
        mode: tcp

Note that the path is ignored due to the nature of the TCP protocol.

Additionally, it would be interesting to consider a mode: passive, which enables passive checks based on observing communication with back-end nodes. However, this is not as straightforward.

leonidas-o commented 1 month ago

@nmengin what do you mean by "we’re unsure about the use case"? I mean it's a tcp health check, which is really needed in such a product. Just look at #5598 there you will see, people are asking for it. Another example and real use case: Installing OpenShift with traefik as a load balancer

https://docs.okd.io/4.16/installing/installing_bare_metal/installing-bare-metal-network-customizations.html#installation-load-balancing-user-infra-example_installing-bare-metal-network-customizations

They use HAProxy in the example, as you can see mode: tcp and check /readyz, so there is a need for a tcp health check:

listen api-server-6443 
  bind *:6443
  mode tcp
  option  httpchk GET /readyz HTTP/1.0
  ...

I can't say anything about @eos175 's implementation, but I (and I'm pretty sure a lot of others) would really appreciate if you would accept that. Of course if the implementation quality is up to your defined standards.

nmengin commented 1 month ago

Hello @leonidas-o,

Thank you for your feedback.

@nmengin what do you mean by "we’re unsure about the use case"?

I meant that, with the other maintainers, we're unsure that this feature is expected by a lot of people in the community. We prefer to collect feedback before accepting it as an enhancement in Traefik.

jhvst commented 1 month ago

This proposal is just what I am looking for!

It covers these two cases:

The first case is common in environments in which you have your own certificates that you must present to upstream. This is common in non-cloud environments, which host public services, but use a proxy to mask the origin IP source.

I am looking to migrate to traefik to replace Caddy and nginx, which I have to both use currently for the following reasons:

  1. Caddy does not support TCP proxying
  2. nginx requires a Plus subscription for monitoring TCP endpoint metrics via Prometheus

The proposal in this thread would provide a feature which would make traefik a one-stop solution for my multiple use-cases.

dfrt82 commented 3 weeks ago

I second that, I definitely need this feature. Trying traefik at the moment with a very simple web application that uses websockets. A working http loadbalancing is of no use here as long as there is no way to automatically skip dead servers within TCP loadbalancing as well, otherwise traefik tries to establish websocket connections to the dead servers.

Also I have no idea why the use-case is questioned again, as @leonidas-o pointed out, #5598 should be enough evidence as you already asked for community support.

eos175 commented 3 weeks ago

If anyone tests on the master branch, it includes changes up to version 3.1.0. So far, everything is working well, and there's no longer a 30s timeout delay.

image