StripeClient: CLOSE_WAIT possibly causing servers crash after a few hours

fabiolnm commented 5 years ago

Software Versions

Ruby: 2.6.4
stripe-ruby gem: 5.1.1

Original problems

Too many open files @ rb_sysopen - /proc/version
failed to get urandom

Investigation

sudo lsof is showing hundreds of lines containing:

server_name>api-34-202-153-183.stripe.com:https (CLOSE_WAIT)

sudo lsof | grep CLOSE_WAIT | grep stripe | wc -l
10468

Reproducing CLOSE_WAIT in localhost

rails c

Stripe::Charge.retrieve charge_id, stripe_account: connected_account_id

Wait ~1 minute and run

lsof | grep CLOSE_WAIT

ruby       6216 fabio   17u     IPv4 0x4b0d201b3705f30f        0t0        TCP 192.168.0.12:51558->api-34-200-27-109.stripe.com:https (CLOSE_WAIT)

The process PID 6216 is the rails console:

fabio$ ps -ef | grep ruby
  501  6216  6782   0 10:58am ttys000    0:08.82 /Users/fabio/.rbenv/versions/2.6.4/bin/ruby bin/rails c

lsof keeps listing it until rails console is closed.

Related Google searches

rebelvc commented 5 years ago

To add more context: The crash is happening on our Sidekiq servers. We use Sidekiq to

Retrieve payment processing fees
Update meta tags of stripe payments. Sometimes we process a large amount of these background calls when we run recurring donations.

ob-stripe commented 5 years ago

Thanks for the report. We were able to reproduce the issue and are looking into the best way to fix this. We'll post an update here as soon as possible.

brandur-stripe commented 5 years ago

Sorry about the trouble here @fabiolnm @rebelvc. Could you comment on what your concurrency set up looks like? Are you using multiple processes and/or multiple threads and/or multiple fibers?

We've got likely ideas for where the problem may be, but are trying to narrow it down.

fabiolnm commented 5 years ago

We usually have one Sidekiq instance with SIDEKIQ_CONCURRENCY set to 20 (threads).

Since yesterday, in order to mitigate the problem, we have 2 Sidekiq instances (different servers) with SIDEKIQ_CONCURRENCY set to 10, each.

brandur-stripe commented 5 years ago

Thanks! Just wanted to let you know that I'm working on a fix for this.

I think my original assumption in the design was correct in the way that a connection left in CLOSE_WAIT is reusable — if I start 10 threads, have them start making requests and waiting more than 60 seconds between them (so that the connection enters CLOSE_WAIT), each subsequent request seems to be correctly reusing the connection that was originally created for the thread. The number of connections is stable no matter how long I run the program.

However, one pattern that I realize now that I'm not handling is if threads are used ephemerally. i.e. If a new thread is spun up for every task, or even if threads are occasionally rotated through as old ones die, connections on no-longer-used threads will not be reclaimed.

I think the best thing to do would is to ensure that there's a cleanup mechanism active in the code that GCs connection managers that haven't seen use in some time. Most of the infrastructure for this change is already in place, so it should be relatively quick to write.

brandur-stripe commented 5 years ago

Alright, I believe I have a working patch in #851.

brandur-stripe commented 5 years ago

@fabiolnm @rebelvc I just released what I think is a fix in 5.2.0.

I wrote a program that spins up and then churns out many threads over an extended period of time, and was able to verify that the number of connection managers (and by extension, connections) being managed by StripeClient stayed stable.

Based on a review of the code, I think this is likely to address the problem you're seeing, but I'm not absolutely sure, so we could use your help verifying it. Would you mind updating to the new version and let us know what you're seeing?

rebelvc commented 5 years ago

@brandur-stripe thank you so much for this fix. We have deployed it on production. Before the fix, we have increased the number of sidekiq server so each one has a less chance of overloading.

Now i ran sudo lsof | grep CLOSE_WAIT | grep stripe | wc -l the result is 203 instead of in the thousands previously.

brandur-stripe commented 5 years ago

@rebelvc Excellent, thanks for checking! That number you're seeing should also stay relatively stable over time (maybe not exactly stable, but pretty level), so let me know if you're seeing something else.

One other thing I'll note for clarity: just seeing CLOSE_WAIT isn't necessarily a bad thing — that means we have a closed connection because it's been idle for too long, but it can still be potentially reused if the owning thread wakes up and issues a new request. Even if each thread is making requests to api.stripe.com fairly actively, it may still be holding a connection to files.stripe.com or connect.stripe.com that's used much less frequently. So some of those 203 CLOSE_WAIT you're seeing may still be on threads that can still do work over those connections if the need arises.

Going to close this out for now, but thanks for reporting and the quick feedback.

stripe / stripe-ruby