Closed fabiolnm closed 5 years ago
To add more context: The crash is happening on our Sidekiq servers. We use Sidekiq to
Thanks for the report. We were able to reproduce the issue and are looking into the best way to fix this. We'll post an update here as soon as possible.
Sorry about the trouble here @fabiolnm @rebelvc. Could you comment on what your concurrency set up looks like? Are you using multiple processes and/or multiple threads and/or multiple fibers?
We've got likely ideas for where the problem may be, but are trying to narrow it down.
We usually have one Sidekiq instance with SIDEKIQ_CONCURRENCY set to 20 (threads).
Since yesterday, in order to mitigate the problem, we have 2 Sidekiq instances (different servers) with SIDEKIQ_CONCURRENCY set to 10, each.
Thanks! Just wanted to let you know that I'm working on a fix for this.
I think my original assumption in the design was correct in the way that a connection left in CLOSE_WAIT
is reusable — if I start 10 threads, have them start making requests and waiting more than 60 seconds between them (so that the connection enters CLOSE_WAIT
), each subsequent request seems to be correctly reusing the connection that was originally created for the thread. The number of connections is stable no matter how long I run the program.
However, one pattern that I realize now that I'm not handling is if threads are used ephemerally. i.e. If a new thread is spun up for every task, or even if threads are occasionally rotated through as old ones die, connections on no-longer-used threads will not be reclaimed.
I think the best thing to do would is to ensure that there's a cleanup mechanism active in the code that GCs connection managers that haven't seen use in some time. Most of the infrastructure for this change is already in place, so it should be relatively quick to write.
Alright, I believe I have a working patch in #851.
@fabiolnm @rebelvc I just released what I think is a fix in 5.2.0.
I wrote a program that spins up and then churns out many threads over an extended period of time, and was able to verify that the number of connection managers (and by extension, connections) being managed by StripeClient
stayed stable.
Based on a review of the code, I think this is likely to address the problem you're seeing, but I'm not absolutely sure, so we could use your help verifying it. Would you mind updating to the new version and let us know what you're seeing?
@brandur-stripe thank you so much for this fix. We have deployed it on production. Before the fix, we have increased the number of sidekiq server so each one has a less chance of overloading.
Now i ran sudo lsof | grep CLOSE_WAIT | grep stripe | wc -l
the result is 203
instead of in the thousands previously.
@rebelvc Excellent, thanks for checking! That number you're seeing should also stay relatively stable over time (maybe not exactly stable, but pretty level), so let me know if you're seeing something else.
One other thing I'll note for clarity: just seeing CLOSE_WAIT
isn't necessarily a bad thing — that means we have a closed connection because it's been idle for too long, but it can still be potentially reused if the owning thread wakes up and issues a new request. Even if each thread is making requests to api.stripe.com
fairly actively, it may still be holding a connection to files.stripe.com
or connect.stripe.com
that's used much less frequently. So some of those 203 CLOSE_WAIT
you're seeing may still be on threads that can still do work over those connections if the need arises.
Going to close this out for now, but thanks for reporting and the quick feedback.
Software Versions
Original problems
Investigation
sudo lsof
is showing hundreds of lines containing:Reproducing CLOSE_WAIT in localhost
Wait ~1 minute and run
The process PID 6216 is the rails console:
lsof keeps listing it until rails console is closed.
Related Google searches