jcmoraisjr / haproxy-ingress

HAProxy Ingress

https://haproxy-ingress.github.io

Apache License 2.0

1.04k stars 269 forks source link

Move from configuration templating to HAProxy Data Plane APIs #541

Closed prometherion closed 4 years ago

prometherion commented 4 years ago

First of all, thanks for the great effort for this Ingress Controller: we're using it in production for some of our platforms! :)

Our scenario is quite extreme since, although sharding, we're using a single Ingress class to serve more than 13k ingresses: we noticed a slow down from a performances point of view due to the huge amount of resources the Ingress Controller has to watch and to render in the configuration file, causing some headaches to customers as well.

I0325 20:30:02.267245       8 controller.go:362] Finish HAProxy update id=4926: ingress=228868.095906ms tcpServices=1.882284ms writeTmpl=115314.450049ms reload=124324.179853ms total=468508.608092ms

Our idea is to take advantage of the HAProxy Data Plane APIs, with the following pro and cons, please, feel free to add more items to the following lists.

Pro

Getting rid of the template pain of the configuration file, since this task is going to be delegated to the data plane APIs
let's take advantage of the transactionality of the Data Plane APIs, allowing us to perform bulk updates with ease.
easier to test: APIs are going to be a source of truth, just need to assert the underlying client has been called with expected arguments.

Cons

not yet benchmarked, I'm not sure this could decrease the template rendering.
Data Plane APIs don't allow to add arbitrary configuration file snippets as the annotation ingress.kubernetes.io/config-backend allows.

From a Kubernetes point of view, the Pod would consist of multiple containers:

haproxy simply and just running the HAProxy binary, sharing the socket on an emptyDir to let it share with the following container
dataplane-api running the Data Plane API binary serving on loopback (127.0.0.1), connecting to the HAProxy instance via the socket shared on the said emptyDir
ingress-controller running the Ingress Controller codebase that is getting events from the Shared Informer to update HAProxy configuration through the Data Plane API.

We could also use a single container but you know, one container for a single process: let's talk!

As a member of Namecheap Cloud Team organization, I would be happy to discuss and implement this feature, I'm available also to provide a simple PoC as scaffolding.

jcmoraisjr commented 4 years ago

Hi, thanks for evaluating and using haproxy ingress and thanks also for sharing numbers of such a big deployment!

Regarding the config parsing: I'm really impressed with your processing time. We have a somewhat big environment as well: about 4500 ingress, 3000 services, 1500 namespaces but all the processing takes up to 5s -- five seconds! Would you mind to share a few more numbers - or some of them? How many namespaces and services do you have? Cloud or bare metal? Cloud or self-made control plane (apiserver et al)? Mean size of the nodes? How much of them?

Regarding Data plane API: This'd be a good step forward reusing code from HAProxyTech. You have strong arguments and I agree with all of them. However, even if unsupported features like custom configs could be added, my main concern here is the dependency with an external component in the core of the controller. I had some issues in the past which made me rewrite most of the controller and stop new features for about 1 year. That said I can say a config option which points to the data plane API is a very welcome feature, provided that the use is optional and the integration starts decoupled from the core. Perhaps we can say in the future that this is the right step and we are, currently, a bit late and should be made such step a while ago.

Regarding optimizations: v0.8 was the rewrite to a cleaner code base and a much more optimized configuration; v0.9 was the h2/grpc and acme features; v0.10 was the metrics; v0.11 will be a new step forward regarding optimizations. There are still some pending performance improvements in the config parser (controller being fast), in the dynamic update analyzer (controller being smart), and also in the config builder (haproxy being fast) that need to be made. You should see better numbers in this new version. Note also that haproxy has some limitations regarding what you can do via api/socket without running a full reload: the config from k8s still need to be parsed, config files still need to be written to disk, haproxy still need to be reloaded. Data plane API cannot skip this if the diff cannot be applied in a running process.

prometherion commented 4 years ago

Would you mind to share a few more numbers - or some of them?

Sure, actually we got:

~18k endpoints
~18k services
~13k ingresses (as said before)

All of these resources are living in a single Namespace (it's a sharded cluster just serving apps for the customers).

This cluster is made of 300 worker nodes, powered by 6 vCPU and 32GB of RAM. Most of them got ~100 running pods. All nodes are provisioned by OpenStack, any cloud provider involved.

We're running v0.9-beta.1 with the following args:

        - --configmap=ingress/haproxy-ingress-1-controller
        - --ingress-class=REDACTED
        - --tcp-services-configmap=ingress/haproxy-ingress-1-controller-tcp
        - --default-backend-service=ingress/haproxy-ingress-1-default-backend
        - --default-ssl-certificate=REDACTED
        - --rate-limit-update=0.1
        - --verify-hostname=false

and the following ConfigMap data:

  backend-server-slots-increment: "42"
  config-global: |
    cpu-map auto:1/1-4 4-7
    tune.pattern.cache-size 0
  health-check-interval: ""
  healthz-port: "10253"
  max-connections: "20000"
  nbthread: "4"
  ssl-engine: rdrand
  ssl-options: no-tlsv10 no-tlsv11 no-tls-tickets
  stats-port: "1936"
  syslog-endpoint: localhost:514
  timeout-client-fin: 5s
  timeout-server-fin: 5s
  use-htx: "true"

No resource limits or requests at all.

my main concern here is the dependency with an external component in the core of the controller

You're right, this should be added to the Cons list items. Although its validity, it means to rely on an open-source project created and maintained directly from HAProxy itself: we should evaluate the pay-off and getter more visibility about the maturity of the APIs.

That said I can say a config option which points to the data plane API is a very welcome feature, provided that the use is optional and the integration starts decoupled from the core

My main concerts is that could be tricky and require a lot of effort: my statements could be wrong since I'm looking at the source code from a while, but it means we have to declare a sort of Interface to let update the configuration and it means decoupling the actual code, moving it to a new struct implementing that.

Pretty sure, the first step should go through this direction: defining Interface to easier the strategy selection (TemplateConfiguration or DataPlaneConfiguration).

Regarding optimizations (...)

The main issue we're facing is the constant generation of the template although the following scenario.

A customer creates a new app (so, an Ingress) with a domain customer-temporary-domain.namecheap.tld
The customer decides to switch over a custom domain (my.fancy.domain.tld)
We end-up with those monster times

The same applies when a Pod (so, an underlying endpoint is going to be updated) is going to be deleted or created.

According to our benchmarks, everything was fine until 5k Ingress resources: after that, reload time started increasing rapidly.

If I can provide you further details, don't hesitate to ping me directly: glad to help and that's my actual duty here in Namecheap :)

jcmoraisjr commented 4 years ago

Although its validity, it means to rely on an open-source project created and maintained directly from HAProxy itself

This is exactly what I did in the beginning of the project: I relied on the so called "Kubernetes Ingress", a component used to build ingress controllers. Long story short: what was a component is now ingress-nginx, the reusable part is tight coupled with nginx, closed (golang internal package) and I needed to wrote my own - currently missing only listers and certs. I'm aware data plane api is another story: its main purpose is to be a rest API used to drive haproxy and it's exactly this feature haproxy ingress would use. Note however that most of the functionalities need to touch haproxy in a way or the other, and I'm a bit afraid to find unsupported and hard-to-implement features in the data plane api which would stuck the development. Because of that I'd like to see one step at a time in that direction, evolving both strategies until its maturity.

My main concerts is that could be tricky and require a lot of effort

Yep, no doubt about this. You said:

define Interface to easier the strategy selection (TemplateConfiguration or DataPlaneConfiguration)

and I'd also say:

Stress haproxy via data plane api using a config as big as yours. This will give an idea if the gains compensate the effort
Evolve the parser to update a dynamic in-memory model instead of create a new config and compare. This should let dynamic updates and API usage easier;

Regarding next steps: I'll start v0.11 shortly. In fact I'll promote v0.9 to stable and v0.10 to beta within a few days and will start to think about performance improvement shortly, which is the main focus of v0.11. Perhaps you can have a look at the data plane api performance? I'll design and run a proof-of-concept of the dynamic in-memory model and split config in more than one file, these features should save some io and cpu cycles.

If I can provide you further details, don't hesitate to ping me directly

Thank you, I'll probably do it. Thanks also for such a nice use case.

prometherion commented 4 years ago

I'd like to see one step at a time in that direction, evolving both strategies until its maturity

LGTM.

Stress haproxy via data plane api using a config as big as yours (...) Perhaps you can have a look at the data plane api performance?

I guess that it's my first task for the upcoming week :) We're lucky because I can benchmark the Ingress Controler directly with production traffic.

jcmoraisjr commented 4 years ago

Hi, I started the draft of a new functionality which is simple to implement and should help to distribute the controller load. It was already planned to v0.11 and I'm planning to start now due to its simplicity. This doesn't invalidate the data plane api evaluation, dynamic model and other possible future optimizations. How it sounds on your deployment? Or, almost the same question in other words: why don't you use ingress-class?

draft

define two controller types: masters and workers (names might be changed)
declare type via command-line option
type is optional, defaults to the current behavior
about masters
- have the public IP or receive the requests from the external load balancer
- L4 load balance of https requests - triage via SNI extension
- L7 load balance of http requests - triage via Host header
about workers
- can and should be splitted into groups
- each worker group is labeled with k8s pod label (masters can easily find) and command-line option (worker know who it is) -- improve this to a single config
- each worker group has a list of hostnames they will watch, parse and use to configure their haproxy instance
- define the hostnames of a worker group via regex list of hostnames and/or namespace names
- need a default worker group for SNI or Host header that doesn't have a match

pros

allows to horizontally scale haproxy ingress with lightweight L4 proxies
allows to isolate domains with very high load or that uses ssl-passthrough (which creates latency on the whole proxy)
an ingress-class like approach without the need to use more IPs or manage DNS (bare metal/on premise), or add new external load balancers

cons

a more complex topology - one daemonset to masters and one per worker group
wisely distribute controller load and requests load can be tricky on some deployments
worker groups, excluding default, cannot receive requests from clients that don't send the SNI extension

prometherion commented 4 years ago

tl;dr; it sounds like more to a sort of Service Mesh rather than an Ingress Controller and we all know that Service Mesh is a hard topic.

From an Ingress Controller (and guess it's the end-user target), I'm expecting something really easy to set up: just deploy your Deployment/DS resource, getting a LoadBalancer or NodePort to get exposed on the Internet and you're OK. A distributed architecture like this could increase the amount of point of failure although its validity, as increasing the network overhead since we're going to get master to worker (YMMV) to service rather than master (aka Ingress Controller) to service. By the way, what you described is something similar to what we were thinking of at the beginning here at Namecheap, even if it was way more extreme like this and decided to go for the Ingress resources sharding (tl;dr; we got several Ingress Classes and splitting load across them): Kubernetes is also providing really nice Dynamic Admission Control resources like MutatingAdmissionWebhooks that can delegate that kind of business logic up to end-users, so patching the kubernetes.io/ingress.class annotation on the fly just counting the number of ingresses an Ingress Controller is handling, the less loaded, and so on and so forth.

Probably, we have to follow the KISS principle and try to enhance further the performances of configuration over templating, as well with Data Plane APIs if feasible and doable.

jcmoraisjr commented 4 years ago

No, the proposal isn't a sort of service mesh because it continue in the same place of the cluster instead of changing input and output routes of every single pod. Besides that I agree on most of your arguments and, although this should improve and save a few proxies on our own deployment, this is also far from a nice design. Back to the drawing board.

Out of curiosity - you said you were thinking about spllt load across several ingress-class, therefore several controller instances. How'd you expose them?

prometherion commented 4 years ago

Out of curiosity - you said you were thinking about spllt load across several ingress-class, therefore several controller instances. How'd you expose them?

We got two or more Ingress Controllers along with class (e.g.: foo, bar) and when creating an app, we select the less loaded and putting its value in the annotation.

It means app living on foo will be caught up by DNS records *.ingress-foo.domain.tld and so on and so forth with other ingress classes. If the customer wishes to switch to a custom domain just need to point a CNAME/ALIAS to that value and you're ready to go! :)

github-actions[bot] commented 4 years ago

This issue got stale and will be closed in 7 days.