Closed pranikum closed 7 months ago
Is this issue already fixed as part of 2.2.25. Or this is new issue?
It looks like an AB/BA locking condition in the resolvers code (i.e. one code path holds a lock and waits for the second one to be released while the other code path does the opposite). It doesn't remind me anything specific but the resolvers code is tentacular and a small number of fixes were applied between 2.2.24 and 2.2.25, in any case you should update, at least to benefit from latest fixes.
While there are some architectural changes between the 2 versions, could it be a similar contention issue around dns response handling in both 2.2 and 2.4?
See: https://github.com/haproxy/haproxy/issues/1952
In 2.4@#1952: resolv_process_responses is holding resolvers-lock for too long, probably because of many packets processing of the same batch that prevent recv from returning EGAIN for a long time, thus staying in the loop? According to the gdb trace, stuck thread is at dns_recv_nameserver() while others are waiting on resolvers-lock.
In 2.2: dns_resolve_recv() (resolv_process_responses legacy function) seems to also hold resolvers-lock for too long. In @pranikum trace though, according to gdb stuck thread (thread 2) is waiting on server lock line 2310 while holding resolvers-lock, supporting @wtarreau hypothesis.
But I couldn't find where this lock (server lock) could be held either by thread 2 (itself) or thread 1 (currently stuck at the very beginning of dns_process_resolvers) based on the trace and a quick code inspection:
It's not the same problem, because as Christopher enlightened to me, these are two different locks, that are held by the two threads, so by defintion it cannot be a matter of time, it simply means these ones are definitely blocked. And since there are no other threads it implies that each lock is already held by the other thread, hence the deadlock.
Regarding the fact that you didn't find where the locks are taken, I don't know either and it could indeed be that due to a bug another one was not released. Note that there's still a bug in 2.2.24 where a removal from the ebtree is missing, that can lead to a use-after-free, and possibly to such a case as well indirectly. And I agree that it's theoretically possible that both are only suck in a loop and were caught just at the moment they were taking a lock, except that if they're alone on the lock, it only takes one instruction and is very fast, so the probability to land on such an instruction without contention is almost zero. And here we got two spinlocks at once. Of course, looking at more traces from other crashes could infirm or confirm, but I'd say that the chance-over-1-million to catch a thread on a non-contended lock is sufficiently rare for not taking it as a first hypothesis here.
@wtarreau ... Yes there is a plan to update to 2.2.25.. However wanted to be sure if this issue is already fixed.. Looks like it's still there Will try to collect more dumps and add to thread.
The response is: we don't know. Several stability bugs were addressed between .24 and .25 affecting the resolvers area, and this code is extremely susceptible, what you fix often doesn't have an immediate relation with what you observe, so it's really impossible to say if you're experiencing one of the possible side effects of these bugs. What I can only say is that some nasty bugs were fixed there, and another one was even fixed after .25 (and will be in the next .26 once released). If you're building yourself you could even be interested in taking the latest 2.2-stable which contains all fixes that we're aware for 2.2 and that will become 2.2.26.
The 2.2.27 was released. You should try it.
Any news about this issue ?
Still we are in process of taking the latest version... We have not seen this issue very frequently on 2.2.24. Will update once we have completed the migration.
copy that. Thanks !
@pranikum, it's me again. Did you get a chance to test a newer version ?
Yes we have tested with the newer version . We have started to migrate to the newer version. till now we have not seen the issue with it. Will keep monitoring
I'm closing now, it seems fixed. At least not reported by anyone since a while :) Thanks !
Detailed Description of the Problem
We see some random SIGABRT in HAProxy 2.2.24 during dns_process_resolvers. We have a setup of HAProxy running on various nodes on few nodes we see HAProxy going down randomly. We enabled core dump for the HAProxy process. Sharing the backtrace below.
Expected Behavior
HAProxy should not be going down.
Steps to Reproduce the Behavior
NA
Do you have any idea what may have caused this?
No response
Do you have an idea how to solve the issue?
No response
What is your configuration?
Some values are removed.
Output of
haproxy -vv
Last Outputs and Backtraces
Additional Information
No response