haproxy / haproxy

HAProxy Load Balancer's development branch (mirror of git.haproxy.org)
https://git.haproxy.org/
Other
4.87k stars 788 forks source link

Option to silence DNS lookups when DNS updates/a backend server IP changes #1663

Open dekimsey opened 2 years ago

dekimsey commented 2 years ago

Your Feature Request

Allow an option to disable the "X changed its IP from X to Y" warnings.

haproxy-local | [WARNING] 082/114612 (8) : account_service/account_service changed its IP from 123.456.789.123 to 123.456.789.123 by k8s_dns/dns.
haproxy-local | account_service/account_service_1 changed its IP from 123.456.789.123 to 123.456.789.123 by k8s_dns/dns.
haproxy-local | account_service/account_service_1 changed its IP from 123.456.789.123 to 123.456.789.123 by k8s_dns/dns.

I'm not sure if this should be a server, backend, or DNS configuration option but it seems like it's being emitted from multiple locations.

What are you trying to do?

We have a few backends that are defined as DNS entries to external systems (s3 in this example). By design, these systems will rotate DNS their A shuffle their responses. Currently, when set as a backend server haproxy logs an endless stream of "X changed its IP from X to Y".

Refer to discussion on this subject here: https://discourse.haproxy.org/t/stop-logging-x-changed-its-ip-from-y-to-z/6387

Output of haproxy -vv

HA-Proxy version 2.2.3-0e58a34 2020/09/08 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.2.3.html
Running on: Linux 5.15.18-200.fc35.aarch64 #1 SMP Sat Jan 29 12:44:33 UTC 2022 aarch64
Build options :
  TARGET  = linux-musl
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-stringop-overflow -Wno-cast-function-type -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1

Feature list : +EPOLL -KQUEUE +NETFILTER -PCRE -PCRE_JIT +PCRE2 +PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED -BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=4).
Built with OpenSSL version : OpenSSL 1.1.1g  21 Apr 2020
Running on OpenSSL version : OpenSSL 1.1.1g  21 Apr 2020
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.5
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.35 2020-05-09
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 9.3.0
Built with the Prometheus exporter as a service

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
            fcgi : mode=HTTP       side=BE        mux=FCGI
       <default> : mode=HTTP       side=FE|BE     mux=H1
              h2 : mode=HTTP       side=FE|BE     mux=H2
       <default> : mode=TCP        side=FE|BE     mux=PASS

Available services :
    prometheus-exporter

Available filters :
    [SPOE] spoe
    [COMP] compression
    [TRACE] trace
    [CACHE] cache
    [FCGI] fcgi-app
TimWolla commented 2 years ago

HA-Proxy version 2.2.3-0e58a34 2020/09/08 - https://haproxy.org/

@dekimsey It won't do anything about your feature request, but I'd like to note that HAProxy 2.2.3 is severely outdated: https://www.haproxy.org/bugs/bugs-2.2.3.html. The current 2.2.x is 2.2.22 as of now. You should keep your HAProxy recent within the chosen branch.

dekimsey commented 2 years ago

@TimWolla Agreed and thank you for pointing that out!

wtarreau commented 2 years ago

I find it concerning that you're seeing a lot of them, because each time an IP address changes, it's not something without possible impact, which is the reason for these logs! What could be the cause of this, having configured less servers than are advertised in DNS maybe ? I would hope that at least after a few DNS responses your backend is completly filled with working addresses that remain stable for the life of these servers. At least the settings in the resolvers section are made for this (I think it's mostly the timeout hold that's meant to be used for this).

dekimsey commented 2 years ago

Thank you @wtarreau. I think the issue is that haproxy wants stable IPs for the backends, and some systems rely on DNS to load-balance and therefore round-robin IPs and that's entirely normal. In particular, our s3 backends where we are using haproxy in front of some of our s3 resources.

The s3 DNS record is a single A record with a 300s TTL. Each lookup will return a different, single A record. I could set a hold value equal to the TTL, but it'll still log the change every 300s though. And at this point I'm now willfully ignoring the record TTL when I do this. Perhaps one could argue the hold behavior could have have a "ttl" special value that holds valid responses until the TTL expires, maybe even pre-querying anew at the 90% mark.

To be honest, I think this is a situation better suited for server-template. But it too would spam endlessly about the IP changing. I'm stuck with no good options.

In an ideal situation, I think I would: 1) use server-template, as its intended to handle variable # of servers. 2) Perhaps server-template uses a server for the duration of the record TTL. 3) resolve-opts nolog-valid

wtarreau commented 2 years ago

But normally (maybe it's the hold parameter but I'm not sure, CCing @bedis for this), even if you have only one server and the S3 DNS announces multiple servers in round robin, as long as your IP address appears often enough it will not change. Alternately you could possibly declare a few servers and use the one that matches the last announce.

tmccombs commented 2 years ago

it's not something without possible impact

What is the impact with regards to haproxy?

I have a similar problem also with s3. As mentioned, DNS for s3 returns a single A record that changes. But it isn't actually doing a round robin between a small number of ip addresses, it seems to be picking a random ip from a large pool of ip addresseses. Or at least something similar. And what's more, the TTL on these records is very small (looking at some sample queries, the TTL is between 1 and 5 seconds in the DNS response).

Looking at the logs for a single backend using s3 over a 5 minute period, after haproxy had already been running for a while, I got this warning 175 times (a little more than once ever 2 seconds), With 152 unique ip addresses.

Granted, the behavior of s3's DNS is pretty unusual. However, I imagine quite a few people use s3 as a backend. I'd really like a way to avoid spamming the logs with these warnings.

I can somewhat mitigate the amount of the logs by setting hold valid and timeout resolve. But I still get these logs, and I'm hesitant to increase the hold amount too much, in case an ip address becomes unavailable.

Are there any other workarounds?

p.s. I'm actually kind of curious about how often I get this warning with the default hold and timeout settings. According to the documentation the default hold valid time is 10 seconds, so I would expect that a valid response would be kept for 10 seconds, yet it is changing every 1 to 2 seconds. Maybe I'm misunderstanding what hold does?

markonen commented 2 years ago

I'm also seeing this with S3 backends specifically, their 5-second TTLs really make it pop. Perhaps the notice level would be appropriate for this and then maybe no configuration options would be needed?

jiang-gao commented 1 year ago

the same logging issue happens to all my s3 backends as well.

markonen commented 1 year ago

We're getting millions of lines of this log spew each day on our edge clusters that have a number of S3 backends. I can think of workarounds and, of course, we could filter these out downstream, but there really seems to be no ill effect to the IP address change so I'd much prefer a way to just silence this. Any thoughts on making this a notice?

ethanmdavidson commented 8 months ago

This is also an issue when putting haproxy in front of a netlify site.

I understand that an IP change is "not without possible impact", but it's the expected behavior of a third-party system that I have no control over. Printing this line repeatedly creates a lot of noise in the logs, and makes it harder for me to see other events that are more likely to have possible impact.

My preferred solution would be an option to disable this message on a resolver, server, or backend (imo on the resolver makes most sense).

Logging to a less-severe level would also be acceptable, though I don't like this solution as much since I have only a few backends that are expected to change their IPs, and I would prefer to continue being notified if one of the others unexpectedly changes its IP.

havedill commented 3 months ago

I'm randomly seeing this on a Consul Resolver. Instead of creating 2 HAProxy backends (which i have it configured to do) it is rotating my IP on a single backend. I'm on HAProxy Enterprise 2.6

I dont recall this happening when i originally configured my backends.

Jun 29 06:51:23 hostname hapee-lb[30434]: excel_simulator/DisplayServer1 changed its IP from 10.13.155.38 to 10.13.155.104 by DNS additional record.
Jun 29 06:51:27 hostname hapee-lb[30434]: excel_simulator/DisplayServer1 changed its IP from 10.13.155.104 to 10.13.155.38 by DNS additional record.
Jun 29 06:51:30 hostname hapee-lb[30434]: excel_simulator/DisplayServer1 changed its IP from 10.13.155.38 to 10.13.155.104 by DNS additional record.
Jun 29 06:51:32 hostname hapee-lb[30434]: excel_simulator/DisplayServer1 changed its IP from 10.13.155.104 to 10.13.155.38 by DNS additional record.
Jun 29 06:51:35 hostname hapee-lb[30434]: excel_simulator/DisplayServer1 changed its IP from 10.13.155.38 to 10.13.155.104 by DNS additional record.
Jun 29 06:51:37 hostname hapee-lb[30434]: excel_simulator/DisplayServer1 changed its IP from 10.13.155.104 to 10.13.155.38 by DNS additional record.
backend excel_simulator
    balance leastconn
    mode http

    server-template DisplayServer 2 _DisplayServer._simulator.service.consul resolvers dev-consul resolve-opts allow-dup-ip resolve-prefer ipv4 check init-addr none
resolvers dev-consul
    nameserver consul1 10.13.157.15:8600
    nameserver consul2 10.13.157.57:8600
    nameserver consul3 10.13.157.36:8600
    accepted_payload_size 8192
    hold valid 5s

There should be 2 backends, not one. image

Darlelet commented 3 months ago

@havedill are you sure you are not looking at an older haproxy process which uses wrong configuration? Even if the resolver doesn't provide IP address, server-template should initialize the proper number of servers upon startup (2 according to your configuration)

But since it looks like this is a different problem from the one described in this issue, please open a distinct issue if the problem persists.

havedill commented 3 months ago

Ok thanks, i'll message my enterprise support guys and see what they think

tmccombs commented 3 months ago

@havedill does your DNS server rotate the order of responses, or return the records in a random order?

I think haproxy tries to keep the assignment of ip addresses consistent, but maybe there is a bug there?

havedill commented 3 months ago

I'm fairly certain it's related to this https://github.com/hashicorp/consul/issues/21325

I have my dev cluster on the newest consul. Which is now resolving to the exact same name, versus being unique in the SRV output

Edit:

My issues are resolved. Adding experiments = [ "v1dns" ] as a temporary workaround in my consul.hcl allowed HAProxy to see non-matching DNS entries again. This will be fixed in consul 1.19.1. I see all my backends again

NicoAdrian commented 1 day ago

HAProxy 2.2.31 here. Getting flooded by these logs too. Even setting the log level to err (log 127.0.0.1:514 local6 err) doesn't suppress them !

NicoAdrian commented 13 hours ago

HAProxy 2.2.31 here. Getting flooded by these logs too. Even setting the log level to err (log 127.0.0.1:514 local6 err) doesn't suppress them !

Even with no log, they still appear :(

Darlelet commented 9 hours ago

@NicoAdrian yes indeed they are reported using ha_warning() in addition to log on purpose it seems. Thus it gets printed on stderr no matter log settings.

It was implemented here: 14e4014a485860892933e7c9ce0fb3c53c659e99

NicoAdrian commented 9 hours ago

@NicoAdrian yes indeed they are reported using ha_warning() in addition to log on purpose it seems. Thus it gets printed on stderr no matter log settings.

It was implemented here: 14e4014

Thanks. So there's no way to prevent haproxy from logging that ?

Darlelet commented 9 hours ago

Unfortunately no, unless some condition in the code is added to prevent that from reporting it as a warning

NicoAdrian commented 9 hours ago

Unfortunately no, unless some condition in the code is added to prevent that from reporting it as a warning

Clear, thanks.

capflam commented 9 hours ago

Well, from my point of view, this call to ha_warning() during runtime is definitely a bad idea. We must avoid this kind of messages once the startup stage finished. I guess it could also be useful to have a way to disable the log messages too. But it is probably harder to achieve. But we can at least remove the warning during runtime.

NicoAdrian commented 9 hours ago

Well, from my point of view, this call to ha_warning() during runtime is definitely a bad idea. We must avoid this kind of messages once the startup stage finished. I guess it could also be useful to have a way to disable the log messages too. But it is probably harder to achieve. But we can at least remove the warning during runtime.

Yea that would be great. In my case, my backend is not S3 but Akamai Edges and I set the resolve timeout to 20s. So every 20s, for each of my backends (dozens), I have those logs

Darlelet commented 9 hours ago

@NicoAdrian you mean even if the IP address doesn't change? Edit: I guess like S3 Akamai does also rotate DNS?

NicoAdrian commented 9 hours ago

@NicoAdrian you mean even if the IP address doesn't change?

They do change. Every cycle (every 20s in my case). HAProxy does a DNS A query and Akamai answers different adresses each time. Yes Akamai rotates a lot

wtarreau commented 9 hours ago

I continue to think we're discussing about addressing the wrong issue. A server changing its IP address is always a serious enough event to warrant at least a log. It's at least as serious than a server going down. I mean, when a server changes address and you still have sessions on the old one, how will you know from your logs, stats or state dumps which request goes to which server ? How do you even know if the previous incarnation of the server still works, which one is receiving health checks, etc ?

Why in the first place does the server need to change its address ? As much as I dislike the DNS-based service discovery for its total flakiness, it was designed to resist to cases where the address is not advertised for a while due to round-robin responses and such well-known cases. So I'm still wondering what causes the server to change address that often. Do you have enough servers configured in the backend for example, etc ?

NicoAdrian commented 9 hours ago

I continue to think we're discussing about addressing the wrong issue. A server changing its IP address is always a serious enough event to warrant at least a log. It's at least as serious than a server going down. I mean, when a server changes address and you still have sessions on the old one, how will you know from your logs, stats or state dumps which request goes to which server ? How do you even know if the previous incarnation of the server still works, which one is receiving health checks, etc ?

Why in the first place does the server need to change its address ? As much as I dislike the DNS-based service discovery for its total flakiness, it was designed to resist to cases where the address is not advertised for a while due to round-robin responses and such well-known cases. So I'm still wondering what causes the server to change address that often. Do you have enough servers configured in the backend for example, etc ?

Well it really depends. Serious issue or not, I should be able to suppress those logs. In my case, Akamai does send me different edges every 20s. Fine. I don't care and I don't know why they do that and I have no power to change it. Their TTL is 20s that's why I configured that in my HAProxy config

wtarreau commented 8 hours ago

But do they send different ones or are they rotating among a pool, which is completely different ?

Because if they're rotating among a pool, the correct solution is to configure a few servers (or just have a server-template statement) and they'll be distributed to available servers, expiring the oldest ones.

Changing the address all the time is bad for plenty of things, including resource usage, ability to reuse existing connections etc. Closing and re-opening SSL connections for example is a waste of CPU and an increase of latency.

I can understand the intent to follow a constantly moving IP address (in this case there's no load balancing etc, that's just a hack for whatever service that moves constantly), but I first want to be sure that it's not misunderstood. For example, there was another user above showing constantly alternating addresses, which is just the result of the DNS advertising these ones in turn (or even together) and not having enough servers configured to learn them all, resulting in degrading the quality of service.

Warnings are not there to annoy you, they're there so that haproxy tells you "Nico, the environment you deployed me does not seem to accurately match the config you gave me, I will work in a suboptimal way and your users might face a degraded experience every time you're seeing this message, if it bothers you, please first make sure that it's intended and not an accident". That's why I'm still insisting.

Otherwise if you're absolutely certain you don't care at all about sending traffic to random servers, then you can just add quiet in your global section and all warnings and alerts will magically be gone. It's just now I consider that production servers should work.

ethanmdavidson commented 8 hours ago

A server changing its IP address is always a serious enough event to warrant at least a log

I agree with NicoAdrian, it depends. For most of my backends, I'm in control of their networking, and I expect their IPs not to change because that's how I've set things up. Other backends are not under my control, and can change their IP whenever they want.

In these cases, the backend isn't forwarding requests to a specific server, it's forwarding requests to a service, and I have no visibility into how that service handles the request. I don't know if they are rotating among a pool, and if I do know, it could change at any time without warning. For these services, I don't know/care which request goes to which server. There are rare cases where I will need to know the IP that haproxy is forwarding a request to, e.g. when I'm troubleshooting something, but in normal operation these log messages are just spam that adds noise to my logs.

Warnings are not there to annoy you, they're there so that haproxy tells you "Nico, the environment you deployed me does not seem to accurately match the config you gave me, I will work in a suboptimal way and your users might face a degraded experience every time you're seeing this message, if it bothers you, please first make sure that it's intended and not an accident". That's why I'm still insisting.

I totally agree that the warning should be on by default - but for the cases where I know that this is happening, and I can't change it, and I accept that it is suboptimal, I need a way to silence it. Putting quiet in the global section is not desirable because I still want to get notified about other problems as they come up. I have one backend where this message is constantly spammed, and I ignore it. If any of my other backends printed this message, it would be a big deal! but I would never see it because it's drowned out in the spam. This is why there needs to be an option on the server, backend, or resolver IMO.

wtarreau commented 8 hours ago

Then if you're using it as a service instead of a server, this is where the problem stems from. You shouldn't be using a server in load-balancing form with its automatic address resolution, health checks, weights, idle pools etc which are all reset upon every change, it would be way cleaner to just resolve on the fly when the request arrives and forward the request to the IP address that results from the resolution.

For example I think that doing something approximately like this (adapted from the config manual) would be much cleaner (and wouldn't report address changes since no address changes):

resolvers mydns
  nameserver dns1 1.1.1.1
  nameserver dns8 8.8.8.8
   timeout retry   1s
   hold valid 10s
   hold nx 3s
   hold other 3s
   hold obsolete 0s

frontend frt
    bind ...
    use_backend bck if { path /blah }

backend bck
   # dedicated to service.provider.tld
   http-request do-resolve(txn.svcip,my_dns_section,ipv4) str(service.provider.tld)
   http-request set-dst var(txn.svcip)
   server svc 0.0.0.0:80  # will go to the address choosen by set-dst above

It could also make sense to add the IP address used and the source port into the logs so as to help troubleshooting connectivity issues (correlation with tcpdump, firewalls etc).

tmccombs commented 7 hours ago

You shouldn't be using a server in load-balancing form with its automatic address resolution, health checks, weights, idle pools etc which are all reset upon every change

I have a case where I have two s3 buckets in two different regions.

I want to prefer the bucket that is closer, but if health checks fail for that bucket fall back to the bucket in the other region. I don't really need load balancing, but I do want a fallback.

I'd also like to be able to re-use connections between haproxy and backend servers.

And does do-resolve use a dns cache? It isn't really clear from the documentation.

Another possible use case for having health checks is if you want detect if a service becomes unacceptably slow, and then short circuit requests with a 503 until it becomes healthy again. Although maybe that could be accomplished with do-resolve with a gpc or something.

But do they send different ones or are they rotating among a pool, which is completely different ?

Because if they're rotating among a pool, the correct solution is to configure a few servers (or just have a server-template statement) and they'll be distributed to available servers, expiring the oldest ones.

For S3, I think the way it works is effectively it selects an ip address from a very large pool of ip addresses, probably thousands of possible ip addresses.

wtarreau commented 7 hours ago

Regarding checks, you realize that your checks will not even all go to the same server, and that the resulting status will be a combination of various checks sent to different servers ? I'm sorry, but all I'm reading sounds totally disgusting. We're trying to design a component that focuses on reliability and observability to help with troubleshooting and all I'm reading here is "well, let's send requests wherever, if they find their way it's probably not that bad after all".

I understand that there is probably a use case and real needs behind this, and am not rejecting them. I'm just trying to gauge what exactly is supposedly needed so that we figure what has to be adapted (or even what to add) to handle that case correctly. For now I've only read about absolutely atrocious hacks, I'm sorry :-(

tmccombs commented 5 hours ago

Regarding checks, you realize that your checks will not even all go to the same server, and that the resulting status will be a combination of various checks sent to different servers ?

Sure. But I don't care about individual servers. I care about the service as a whole (for a specific region). And for s3 at least I don't think there is a 1:1 correspondence between ip addresses and physical servers anyway.

But my bigger questions are if I use do-resolve:

wtarreau commented 1 hour ago

Regarding do-resolve, it uses the resolvers sections hence the cache there (though I cannot be very specific as I'm not deep into the DNS stuff, some tests might be needed to verify which parameter exactly acts on this). I believe that the hold obsolete 0 above precisely disables caching. The connection reuse will work based on the (source,destination,sni ) tuple. Right now they're broken with the LB every time the DNS replaces a server's address.

Regarding the service vs servers, that's exactly what I'm understanding from your use case as well, and I'm thinking that maybe we should have a way to deal with DNS advertisements (and LB) differently. We could for example have an option that indicates that servers should not be seen as individual ones but as part of a pool delivering a service. In this case, we could enable a few servers in a backend and make sure that the DNS is learned in rotation (always replace the oldest server only), start them down (via init-state down) and wait for health checks to succeed before sending traffic there, and having a new balance latest algorithm that would only send the traffic to the lastest valid server, so as to gracefully replace previous ones. This would ensure that newly learned servers are used as soon as possible, not needlessly making servers go up and down, and likely improve connection reuse. Some changes might be needed, such as maybe just refreshing a server having the advertised address etc. We would see them as a stack, in some sort.

That's probably the way DNS should be used for such services, instead of doing load balancing etc try to hide the dust under the carpet. I don't know if that makes sense to you as well based on your use case.