Closed PhilCoggins closed 3 years ago
Can you try reverting that commit on a local fork and see if it resolves the issue?
Yes, I will run the below in production and report back, maybe wait until next week depending on if it resolves or not just to be sure.
Great, thanks! If it fixes the problem, please open a PR!
Thanks @tarcieri! It would be easy enough for me to open a PR with my linked branch, but I think it's appropriate to tag the original author @midnight-wonderer before we wipe out his valued contribution in the next release without warning.
@midnight-wonderer: I believe the retry loop implemented in this feature is responsible for consuming the thread pool in my BG workers. I'm unable to produce a failing test script, and I don't fully understand the conditions or low-level details of why this retry loop hangs indefinitely, but the results in my application have made it pretty clear that this commit is the culprit.
I'm happy to debug or provide details as requested, I just need some hints or a nudge in the right direction as I don't have the time or resources to wrestle such a complex, intermittent bug.
Looking into it, getting back to you in a day (or two).
I know you understand the code already. I have an argument that the change in the PR is an abstract one.
The change provides high-level policies of what the software should behave without specifying the implementation details. The addition of code does not specify how to resolve the address (moved the definition to the constructor), interact with the OS, handle the network stack, or thread stuffs.
Apart from the policy level, there is no change from the existing code. While the issue surfaced at the commit, the real culprit might be elsewhere.
This
def connect(socket_class, host, port, nodelay = false)
reset_timer
::Timeout.timeout(@time_left, TimeoutError) do
@socket = socket_class.open(host, port)
@socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1) if nodelay
end
log_time
end
is the same as this
def connect(socket_class, host, port, nodelay = false)
reset_timer
connect_operation = lambda do |host_address|
::Timeout.timeout(@time_left, TimeoutError) do
super(socket_class, host_address, *args)
end
end
connect_operation.call(host)
log_time
end
no behavior change.
And this
host_addresses = @dns_resolver.call(host_name)
# ensure something to iterates
trying_targets = host_addresses.empty? ? [host_name] : host_addresses
reset_timer
trying_iterator = trying_targets.lazy
error = nil
begin
connect_operation.call(trying_iterator.next).tap do
log_time
end
rescue TimeoutError => e
error = e
retry
rescue ::StopIteration
raise error
is just the translation of this policy
all_possible_addresses.each do |addr|
connect_operation.call(addr)
end
implemented purely with language primitives without libraries or native extensions.
IMHO, if we want to move forward, we should keep the policies and resolve the lower level. Depending on the objective, if we want to make this regression go away right now and drop the progress, we can simply revert the change. This option might break other people who rely on the behavior, though.
From the piece of code, it is either Timeout.timeout
or TCPSocket.open
who created "dead threads".
I guessed the issue stems from TCPSocket
not supporting timeouts.
Should we use Socket#connect_nonblock
instead? or rather do we able to?
However, if the reversal works for you guys, just giving me the head up is enough. If in the end, resolving the regression as fast as possible is the top priority, I have no objection to it.
I got some clue, but can't really confirm.
I knew from somewhere that Enumerator
can interfering with threads somehow.
Probably mentioned in one of @ioquatix
talks.
Can you get rid of lazy
and try again?
Probably with something like:
trying_targets = host_addresses.empty? ? [host_name] : host_addresses
reset_timer
error = nil
trying_targets.each do |target|
connect_operation.call(target).tap do
log_time
end
error = nil
break
rescue TimeoutError => e
error = e
next
end
raise error if error
@midnight-wonderer Thanks for the response and explanation! I agree with your assessment that your change is higher level and that the implementation details are at the core of this. I'm not comfortable to continue use my organization's production environment as a testing ground to troubleshoot, at least until I have a firm grasp on what is happening at the lowest level, and can reproduce the behavior locally.
My best guess at this point is that your change has compounded this underlying issue by attempting to open N + (Number of A
records returned from Resolv.getaddresses(host_name)
), as opposed to timing out once. My feeling is that timing out calls to potentially unresponsive services with TCPSocket.open
is exhausting available client (outbound) ports, or causing existing ports that were previously timed out, and subsequently not closed properly, to be selected and then being hung.
This is conjecture, and until I can repro locally, I feel I'm flying blind and spinning my wheels. You mentioned TCPSocket not being able to support timeouts, though I would point out that Ruby 3 does support this, and net/http
now uses this option to handle connect timeouts, though of course http
must support Ruby < 3, so that probably isn't helpful. FWIW, I think Timeout
is the primary culprit here.
If you or @tarcieri can provide any ideas to get a reproducible script, I am eager to learn and help figure out resolution for this issue. I am in no rush to merge or resolve if this problem is isolated to my usage, we've patched to remove this commit and no longer experiencing issues.
Oh, OK, my bad; I thought you could reproduce it already since you can pinpoint and confirm that the issue will go away if we revert the commit.
I think only system programmers who know Ruby can explain exactly what happened. I believe Timeout.timeout
's implementation is too arbitrary and does not look trustworthy, but I still don't understand why there is no problem before the commit.
Since system programming is not my ground, I cannot assist any further than this.
But, if I have some spare time, I'll try to see how I can help get rid of Timeout.timeout
, at least for Ruby 3.
To clarify, I did revert your commit and kept everything else in place on 5.0.2, and we have not noticed the issue resurface in 48 hours now.
I see that https://github.com/httprb/http/pull/638 added some begin/rescue/retry blocks that have no retry count limits. So i guess the problem is that for some reason it starts raising and retrying forever in some situations.
So i guess the problem is that for some reason it starts raising and retrying forever in some situations.
It will only retry N + (Number of A records returned from Resolv.getaddresses(host_name)
) times. After the final iteration, ::StopIteration
raises and then it re-raises the last error that occurred.
As others have noted, Timeout
is quite problematic and not particularly safe. Unfortunately it is the only way to get timeouts while leveraging the full functionality of the system resolver (i.e. libc's) because it's calling into a native library.
The main alternative is to use Resolv::DNS
which is written in Ruby and supports its own timeout mechanism via Resolv::DNS#timeouts
, however it doesn't support the full functionality of the system resolver.
@tarcieri My understanding is that timeouts are not occurring on DNS resolution, but on actually opening a TCPSocket
to individual addresses? What am I missing?
Oh sorry, perhaps I was missing something?
Here's an old example I wrote of how to do non-blocking connect, but as noted above, it involves using Socket
instead of TCPSocket
:
https://github.com/socketry/socketry/blob/master/lib/socketry/tcp/socket.rb
Didn't #638 change the meaning of the timeout
parameter during connection?
Before #638, DNS resolution took place within the scope of connection timeout block (via getaddrinfo
during TCPSocket.new
). Now it's taking place outside the scope of the connect timeout. After #638, if the hostname resolved to multiple addresses, and the timeout is reached, it will get restarted, and another attempt will be made to the next address.
So, for example, if we are trying to connect to a hostname which resolves to 8 addresses and our connect timeout is 30s, then the maximum runtime (if none of the hosts are accepting requests) has increased from 30s to 240s.
Should that have been a breaking change? Have I misunderstood anything here?
@jordanstephens for me it looks like you are right and this is how PerOperation
works now. This doesn't explain why threads could be hanging forever, though.
Seems like we should just revert this for now.
@PhilCoggins mind opening a PR?
~I believe the combination of log_time
and @time_left
prevents the "resetting" of the timeout that @jordanstephens describes.~ Regardless, I can't get to the bottom of this, so I think it would be a wise decision to revert: https://github.com/httprb/http/pull/695.
Surely we can just rely on BGP to route traffic to proper addresses...right? š
EDIT: It is correct however that DNS resolution takes place outside the scope of the timeout block...perhaps that is to blame here.
EDIT 2: Axed the above, if Timeout is raised, log_time
would never be called and thus the connect timeout is actually fresh for every address.
Didn't #638 change the meaning of the timeout parameter during connection?
Before #638, DNS resolution took place within the scope of connection timeout block (via getaddrinfo during TCPSocket.new). Now it's taking place outside the scope of the connect timeout. After #638, if the hostname resolved to multiple addresses, and the timeout is reached, it will get restarted, and another attempt will be made to the next address.
So, for example, if we are trying to connect to a hostname which resolves to 8 addresses and our connect timeout is 30s, then the maximum runtime (if none of the hosts are accepting requests) has increased from 30s to 240s.
Should that have been a breaking change? Have I misunderstood anything here?
Yep, you misunderstood something. The timeout in question is the "connection timeout". It means timeout per connection/connection attempt.
If you have 3 IP addresses the total timeout can be multiplied by 3. This is consistent with Nginx when used as a reverse proxy. I also discussed with Daniel (curl author) last year about this very same topic, he added options to separate the two. (between total timeout and per connection attempt)
If you set the connection timeout to 3 seconds and you have 3 IP addresses to connect, it doesn't make sense to divide the connection timeout by 3 and allocate 1 second for each. The number of IP addresses is controlled by other entities after all.
@midnight-wonderer but do you agree that after #638 the DNS lookup is done outside of timeout block? So if it hangs for some reason, it's going to hang forever?
I think @jordanstephens hit it on the money, he mentioned connection timeout multiple times. Also, I don't think anyone is saying it's right or wrong behavior, just that it's a breaking change that went out in a patch release and was not made obvious.
@midnight-wonderer but do you agree that after #638 the DNS lookup is done outside of timeout block? So if it hangs for some reason, it's going to hang forever?
I had actually overlooked this possibility; thanks for pointing it out. That could be a possible explanation for our hung threads.
Note: #695 shipped in v5.0.3.
@midnight-wonderer but do you agree that after #638 the DNS lookup is done outside of timeout block? So if it hangs for some reason, it's going to hang forever?
Sure, that is the breaking change.
My apology for not realizing back then that, on this aspect, it is the breaking change. However, back then, apart from the DNS resolution timeout, the commit just implemented the previously unspecified behavior. Hence, I did not mention the breaking change.
There is probably no way to do it in a way that everyone can agree upon. Some folks assume the timeout behavior based on the fact that the client won't attempt more than one connection. Some people do expect the connection timeout, based on the name, to cover the time period between TCP SYN/ACK only.
I don't think anyone is saying it's right or wrong behavior.
Me either; previously, I just addressed the part @jordanstephens brought up related to multiple timeouts. On that aspect, the meaning of the term is unchanged; the previous implementation just doesn't specify behaviors.
Probably, the feature is impossible for httprb
after all.
Surely we can just rely on BGP to route traffic to proper addresses...right? š
Do you mean using BGP routing in place of multiple-addresses A records?
Not for the vast majority of people, I believe the smallest block you can announce on global route table is /24. And for the failing-over use case, you probably want to announce the block from the same application server that needs failover; if that server goes down, the route is gone too; the route persists otherwise.
Just what I currently understand, though. Not sure if this is accurate/up to dated.
If everyone fails over on the BGP level, the IPv4 would probably be exhausted, several years, if not a decade ago. Only a tiny percentage of people do have that luxury.
The timeout behavior is deadline-oriented, so every step of the way you need a stateful timeout counter from which you deduct the elapsed time of each operation, and always use the stateful timeout for each operation.
Hello,
We recently upgraded to 5.0.2 and almost immediately started to notice major degradation on one of our Sidekiq queues that make heavy usage of
HTTP
. I have attached a screenshot that shows ourjobs.wait
metric - how long a job takes to complete from the time it was enqueued, in ms for the last four weeks. We started seeing a lot of "dead" threads, that is jobs making external HTTP requests, that would hang indefinitely and exhaust our worker pool, until we restarted dynos, which would immediately resolve after restart. The vertical line shows when the release with the HTTP upgrade was rolled out. I downgraded back to 5.0.1 today and we thus far have observed no dead threads.I strongly suspect a regression in this pr, but haven't yet been able to build a reproducible test case due to the intermittent nature of external resources and the concurrent access to them. I'm happy to provide additional information and welcome any tips on how I can troubleshoot. We do use the global
.timeout(20)
in these requests.Thanks!