Upgrade to 5.0.2 potentially caused hanging threads

PhilCoggins commented 3 years ago

Hello,

We recently upgraded to 5.0.2 and almost immediately started to notice major degradation on one of our Sidekiq queues that make heavy usage of HTTP. I have attached a screenshot that shows our jobs.wait metric - how long a job takes to complete from the time it was enqueued, in ms for the last four weeks. We started seeing a lot of "dead" threads, that is jobs making external HTTP requests, that would hang indefinitely and exhaust our worker pool, until we restarted dynos, which would immediately resolve after restart. The vertical line shows when the release with the HTTP upgrade was rolled out. I downgraded back to 5.0.1 today and we thus far have observed no dead threads.

I strongly suspect a regression in this pr, but haven't yet been able to build a reproducible test case due to the intermittent nature of external resources and the concurrent access to them. I'm happy to provide additional information and welcome any tips on how I can troubleshoot. We do use the global .timeout(20) in these requests.

Thanks!

tarcieri commented 3 years ago

Can you try reverting that commit on a local fork and see if it resolves the issue?

PhilCoggins commented 3 years ago

Yes, I will run the below in production and report back, maybe wait until next week depending on if it resolves or not just to be sure.

https://github.com/PhilCoggins/http/tree/revert_638

tarcieri commented 3 years ago

Great, thanks! If it fixes the problem, please open a PR!

PhilCoggins commented 3 years ago

Thanks @tarcieri! It would be easy enough for me to open a PR with my linked branch, but I think it's appropriate to tag the original author @midnight-wonderer before we wipe out his valued contribution in the next release without warning.

@midnight-wonderer: I believe the retry loop implemented in this feature is responsible for consuming the thread pool in my BG workers. I'm unable to produce a failing test script, and I don't fully understand the conditions or low-level details of why this retry loop hangs indefinitely, but the results in my application have made it pretty clear that this commit is the culprit.

I'm happy to debug or provide details as requested, I just need some hints or a nudge in the right direction as I don't have the time or resources to wrestle such a complex, intermittent bug.

midnight-wonderer commented 3 years ago

Looking into it, getting back to you in a day (or two).

midnight-wonderer commented 3 years ago

I know you understand the code already. I have an argument that the change in the PR is an abstract one.

The change provides high-level policies of what the software should behave without specifying the implementation details. The addition of code does not specify how to resolve the address (moved the definition to the constructor), interact with the OS, handle the network stack, or thread stuffs.

Apart from the policy level, there is no change from the existing code. While the issue surfaced at the commit, the real culprit might be elsewhere.

This

def connect(socket_class, host, port, nodelay = false)
    reset_timer
    ::Timeout.timeout(@time_left, TimeoutError) do
      @socket = socket_class.open(host, port)
      @socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1) if nodelay
    end
    log_time
end

is the same as this

def connect(socket_class, host, port, nodelay = false)
    reset_timer
        connect_operation = lambda do |host_address|
          ::Timeout.timeout(@time_left, TimeoutError) do
            super(socket_class, host_address, *args)
          end
        end
    connect_operation.call(host)
    log_time
end

no behavior change.

And this

  host_addresses = @dns_resolver.call(host_name)
  # ensure something to iterates
  trying_targets = host_addresses.empty? ? [host_name] : host_addresses
  reset_timer
  trying_iterator = trying_targets.lazy
  error = nil
  begin
    connect_operation.call(trying_iterator.next).tap do
      log_time
    end
  rescue TimeoutError => e
    error = e
    retry
  rescue ::StopIteration
    raise error

is just the translation of this policy

all_possible_addresses.each do |addr|
  connect_operation.call(addr)
end

implemented purely with language primitives without libraries or native extensions.

IMHO, if we want to move forward, we should keep the policies and resolve the lower level. Depending on the objective, if we want to make this regression go away right now and drop the progress, we can simply revert the change. This option might break other people who rely on the behavior, though.

From the piece of code, it is either Timeout.timeout or TCPSocket.open who created "dead threads". I guessed the issue stems from TCPSocket not supporting timeouts. Should we use Socket#connect_nonblock instead? or rather do we able to?

However, if the reversal works for you guys, just giving me the head up is enough. If in the end, resolving the regression as fast as possible is the top priority, I have no objection to it.

midnight-wonderer commented 3 years ago

I got some clue, but can't really confirm.
I knew from somewhere that Enumerator can interfering with threads somehow. Probably mentioned in one of @ioquatix talks.

Can you get rid of lazy and try again? Probably with something like:

trying_targets = host_addresses.empty? ? [host_name] : host_addresses
reset_timer
error = nil
trying_targets.each do |target|
  connect_operation.call(target).tap do
    log_time
  end
  error = nil
  break
rescue TimeoutError => e
  error = e
  next
end
raise error if error

PhilCoggins commented 3 years ago

@midnight-wonderer Thanks for the response and explanation! I agree with your assessment that your change is higher level and that the implementation details are at the core of this. I'm not comfortable to continue use my organization's production environment as a testing ground to troubleshoot, at least until I have a firm grasp on what is happening at the lowest level, and can reproduce the behavior locally.

My best guess at this point is that your change has compounded this underlying issue by attempting to open N + (Number of A records returned from Resolv.getaddresses(host_name)), as opposed to timing out once. My feeling is that timing out calls to potentially unresponsive services with TCPSocket.open is exhausting available client (outbound) ports, or causing existing ports that were previously timed out, and subsequently not closed properly, to be selected and then being hung.

This is conjecture, and until I can repro locally, I feel I'm flying blind and spinning my wheels. You mentioned TCPSocket not being able to support timeouts, though I would point out that Ruby 3 does support this, and net/http now uses this option to handle connect timeouts, though of course http must support Ruby < 3, so that probably isn't helpful. FWIW, I think Timeout is the primary culprit here.

If you or @tarcieri can provide any ideas to get a reproducible script, I am eager to learn and help figure out resolution for this issue. I am in no rush to merge or resolve if this problem is isolated to my usage, we've patched to remove this commit and no longer experiencing issues.

midnight-wonderer commented 3 years ago

Oh, OK, my bad; I thought you could reproduce it already since you can pinpoint and confirm that the issue will go away if we revert the commit.

I think only system programmers who know Ruby can explain exactly what happened. I believe Timeout.timeout's implementation is too arbitrary and does not look trustworthy, but I still don't understand why there is no problem before the commit.

Since system programming is not my ground, I cannot assist any further than this. But, if I have some spare time, I'll try to see how I can help get rid of Timeout.timeout, at least for Ruby 3.

PhilCoggins commented 3 years ago

To clarify, I did revert your commit and kept everything else in place on 5.0.2, and we have not noticed the issue resurface in 48 hours now.

tycooon commented 3 years ago

I see that https://github.com/httprb/http/pull/638 added some begin/rescue/retry blocks that have no retry count limits. So i guess the problem is that for some reason it starts raising and retrying forever in some situations.

PhilCoggins commented 3 years ago

So i guess the problem is that for some reason it starts raising and retrying forever in some situations.

It will only retry N + (Number of A records returned from Resolv.getaddresses(host_name)) times. After the final iteration, ::StopIteration raises and then it re-raises the last error that occurred.

tarcieri commented 3 years ago

As others have noted, Timeout is quite problematic and not particularly safe. Unfortunately it is the only way to get timeouts while leveraging the full functionality of the system resolver (i.e. libc's) because it's calling into a native library.

The main alternative is to use Resolv::DNS which is written in Ruby and supports its own timeout mechanism via Resolv::DNS#timeouts, however it doesn't support the full functionality of the system resolver.

PhilCoggins commented 3 years ago

@tarcieri My understanding is that timeouts are not occurring on DNS resolution, but on actually opening a TCPSocket to individual addresses? What am I missing?

tarcieri commented 3 years ago

Oh sorry, perhaps I was missing something?

Here's an old example I wrote of how to do non-blocking connect, but as noted above, it involves using Socket instead of TCPSocket:

https://github.com/socketry/socketry/blob/master/lib/socketry/tcp/socket.rb

jordanstephens commented 3 years ago

Didn't #638 change the meaning of the timeout parameter during connection?

Before #638, DNS resolution took place within the scope of connection timeout block (via getaddrinfo during TCPSocket.new). Now it's taking place outside the scope of the connect timeout. After #638, if the hostname resolved to multiple addresses, and the timeout is reached, it will get restarted, and another attempt will be made to the next address.

So, for example, if we are trying to connect to a hostname which resolves to 8 addresses and our connect timeout is 30s, then the maximum runtime (if none of the hosts are accepting requests) has increased from 30s to 240s.

Should that have been a breaking change? Have I misunderstood anything here?

tycooon commented 3 years ago

@jordanstephens for me it looks like you are right and this is how PerOperation works now. This doesn't explain why threads could be hanging forever, though.

tarcieri commented 3 years ago

Seems like we should just revert this for now.

@PhilCoggins mind opening a PR?

PhilCoggins commented 3 years ago

~I believe the combination of log_time and @time_left prevents the "resetting" of the timeout that @jordanstephens describes.~ Regardless, I can't get to the bottom of this, so I think it would be a wise decision to revert: https://github.com/httprb/http/pull/695.

Surely we can just rely on BGP to route traffic to proper addresses...right? 😉

EDIT: It is correct however that DNS resolution takes place outside the scope of the timeout block...perhaps that is to blame here.

EDIT 2: Axed the above, if Timeout is raised, log_time would never be called and thus the connect timeout is actually fresh for every address.

midnight-wonderer commented 3 years ago

Didn't #638 change the meaning of the timeout parameter during connection?

Before #638, DNS resolution took place within the scope of connection timeout block (via getaddrinfo during TCPSocket.new). Now it's taking place outside the scope of the connect timeout. After #638, if the hostname resolved to multiple addresses, and the timeout is reached, it will get restarted, and another attempt will be made to the next address.

So, for example, if we are trying to connect to a hostname which resolves to 8 addresses and our connect timeout is 30s, then the maximum runtime (if none of the hosts are accepting requests) has increased from 30s to 240s.

Should that have been a breaking change? Have I misunderstood anything here?

Yep, you misunderstood something. The timeout in question is the "connection timeout". It means timeout per connection/connection attempt.

If you have 3 IP addresses the total timeout can be multiplied by 3. This is consistent with Nginx when used as a reverse proxy. I also discussed with Daniel (curl author) last year about this very same topic, he added options to separate the two. (between total timeout and per connection attempt)

If you set the connection timeout to 3 seconds and you have 3 IP addresses to connect, it doesn't make sense to divide the connection timeout by 3 and allocate 1 second for each. The number of IP addresses is controlled by other entities after all.

tycooon commented 3 years ago

@midnight-wonderer but do you agree that after #638 the DNS lookup is done outside of timeout block? So if it hangs for some reason, it's going to hang forever?

PhilCoggins commented 3 years ago

I think @jordanstephens hit it on the money, he mentioned connection timeout multiple times. Also, I don't think anyone is saying it's right or wrong behavior, just that it's a breaking change that went out in a patch release and was not made obvious.

jordanstephens commented 3 years ago

@midnight-wonderer but do you agree that after #638 the DNS lookup is done outside of timeout block? So if it hangs for some reason, it's going to hang forever?

I had actually overlooked this possibility; thanks for pointing it out. That could be a possible explanation for our hung threads.

tarcieri commented 3 years ago

Note: #695 shipped in v5.0.3.

midnight-wonderer commented 3 years ago

@midnight-wonderer but do you agree that after #638 the DNS lookup is done outside of timeout block? So if it hangs for some reason, it's going to hang forever?

Sure, that is the breaking change.

My apology for not realizing back then that, on this aspect, it is the breaking change. However, back then, apart from the DNS resolution timeout, the commit just implemented the previously unspecified behavior. Hence, I did not mention the breaking change.

There is probably no way to do it in a way that everyone can agree upon. Some folks assume the timeout behavior based on the fact that the client won't attempt more than one connection. Some people do expect the connection timeout, based on the name, to cover the time period between TCP SYN/ACK only.

I don't think anyone is saying it's right or wrong behavior.

Me either; previously, I just addressed the part @jordanstephens brought up related to multiple timeouts. On that aspect, the meaning of the term is unchanged; the previous implementation just doesn't specify behaviors.

Probably, the feature is impossible for httprb after all.

midnight-wonderer commented 3 years ago

Surely we can just rely on BGP to route traffic to proper addresses...right? 😉

Do you mean using BGP routing in place of multiple-addresses A records?

Not for the vast majority of people, I believe the smallest block you can announce on global route table is /24. And for the failing-over use case, you probably want to announce the block from the same application server that needs failover; if that server goes down, the route is gone too; the route persists otherwise.

Just what I currently understand, though. Not sure if this is accurate/up to dated.

If everyone fails over on the BGP level, the IPv4 would probably be exhausted, several years, if not a decade ago. Only a tiny percentage of people do have that luxury.

tarcieri commented 3 years ago

The timeout behavior is deadline-oriented, so every step of the way you need a stateful timeout counter from which you deduct the elapsed time of each operation, and always use the stateful timeout for each operation.

httprb / http

Upgrade to 5.0.2 potentially caused hanging threads #692