Closed maximshd closed 4 years ago
Thanks for the bug report. This is related to how haproxy handles SRV records starting with 2.2.
When a name is already known, we only care about weight change which is obviously a problem when an address changes. https://github.com/haproxy/haproxy/blob/master/src/dns.c#L578
You mean that lines 603 and 604 make no sense if I understand right ? That sounds plausible indeed. I really have no idea how SRV records work (in details) so I can't judge if it's normal or not to only process the weight but my intuition tells me it sounds fishy, at the very least a comment explaining why would be needed.
@maximshd could you please try to comment out or remove these two lines from dns.c ?:
603 if (srv)
604 continue;
Be careful, don't do that on your prod if you only have one server, as I'm not certain of any possible side effects. It's just to validate Jerome's idea.
I'm not sure what the problem is but when I tried debugging/fixing it earlier I found this part of the code to be likely to be close to the problem.
I've tried your suggestion of removing lines 603 and 604 but this leads to more servers being set, and has no effect on address changes being ignored.
here there should only be 4 servers set as I have 4 entries for my SRV record.
[WARNING] 246/220556 (416028) : in/a1 changed its IP from to 192.168.135.3 by DNS additional record.
[WARNING] 246/220556 (416028) : in/a2 changed its IP from to 192.168.135.2 by DNS additional record.
[WARNING] 246/220556 (416028) : in/a3 changed its IP from to 192.168.135.4 by DNS additional record.
[WARNING] 246/220556 (416028) : in/a4 changed its IP from to 192.168.135.1 by DNS additional record.
[WARNING] 246/220557 (416028) : in/a5 changed its IP from to 192.168.135.3 by DNS additional record.
[WARNING] 246/220557 (416028) : in/a6 changed its IP from to 192.168.135.2 by DNS additional record.
Captain obvious me thinks we should loop through all additional records and see if one matches the hostname currently associated to a given server and update the address when needed.
OK thanks. Let's CC @capflam in case he has any obvious idea on this given that he seems to have changed this code recently.
This sounds like the same issue as #793 with a fix that came in this mailing list posting here and commit 87138c3524bc4242dc48cfacba82d34504958e78 on the master branch.
I hope that an haproxy 2.2.3 release with this fix will come out soon?
It seems a bit more complicated. Without SRV records, there are complex searches of which addresses are in use to make sure the advertised addresses are properly reassigned regardless of their order, and it doesn't seem that this is done as well for SRV records when they're received at once, so that could be the explanation of some of the issues. Christopher has started to look into this but it seems more like a limitation of the current design than just a bug to fix, so it might take a bit more time than expected.
Just a small update. I think I fixed the bug. @jmagnin validated my patch. I must clean it up before pushing it. But it should be ok soon.
I pushed a fix. I hope it does not break anything else. I'm not a DNS expert and this part in HAProxy is pretty fuzzy for me. Because @jmagnin already validated it, I considers the issue as fixed and I will do the backport soon.
I think we can issue a 2.2 with it soon. Fortunately that one didn't touch older versions so the risk is limited here.
backported where desired, now closing.
Detailed description of the problem
Recently we upgrade our haproxy version to 2.2.2 and observed pretty strange behavior while working with consul DNS SD- some backends have stale or cached IP addresses after backend scale down/scale up events.
On our setup we had 6 servers initially. after that 2 servers went down because of scale down event. As a result, haproxy put them on MAINT state for a while and this is expected behavior. But after a while another scale up triggered spin up of 2 new instances but these new instances were not able to reach UP state because of failed httpchk
After checking haproxy stats, we found that backend server
gateway3
from the log above has IP of the old server which was registered undergateway3
name few hours before.Backend IP from haproxy stats -
10.200.11.50
. Real server IP according to consul reply -10.200.1.168
And backends/haproxy cannot recover from such state by itself, only restart helps.
Expected behavior
Use correct up-to-date IP address for communication with backends
Steps to reproduce the behavior
Do you have any idea what may have caused this?
No
Do you have an idea how to solve the issue?
Checking
What is your configuration?
Output of
haproxy -vv
anduname -a
Additional information (if helpful)