Closed raykrueger closed 11 years ago
The problem is that, under load, garbage collection does not get to run
The above is your bug.
attached to threads are left behind
You have a bigger problem, because this means the threads are being left behind. The threads are more expensive than the sockets.
This may be applicable to other threading/forking situations like unicorn and passenger as well.
It should not be, unicorn is single threaded, and I don't know about passenger, but I'd hope it's using thread pools.
I haven't looked, but maybe this has something to do with celluloids use of fibers? (cc @tarcieri)
If it is celluloid related, I would recommend that you create an actor that handles stats submissions, and ship the stats through it. This way you'll only ever have a single statsd socket. You could also have the advantage of never blocking on stats submissions, with the only danger that you'll fill the actors mailbox instead (which could look like a leak). Example follows:
class Stactor
include Celluloid
def initialize statsd
@statsd = statsd
end
def method_missing name, *args, &block
@statsd.send(name, *args, &block)
end
# TODO respond to, etc
end
Now you can send them all asynchronously:
Stactor.new(statsd).async.increment "ohai"
One thing to watch out for however, is if you call async
and then pass a block, you're not in the original stack context inside the block, and so you need to handle the concurrency problems yourself. Statsd itself won't protect you against that, although it should work internally. Batching should not be used across threads, typically if you're doing batching, keep that batch object around, and use it from a single thread (like, in an actor ;-) )
Is it possible this is related to thread locals? All released versions of Celluloid presently do not clean thread locals when threads are returned to Celluloid's thread pool. This is fixed on Celluloid master.
Well if it were not cleaning up thread locals I think that would be fine, as statsd would just reuse the existing socket and likely be fine. I know absolutely nothing about Celluloid right now though. Does it use a thread pool or does it just create and destroy threads as needed?
@raggi makes the point that really, it seems like I'm leaking threads that still have UDPSocket sockets connected.
I also totally forgot to mention that this is on ruby 1.9.2p290 on ubuntu 11.10.
Celluloid uses a thread pool, so if it fails to clean up the state of a thread properly before recycling them it can leak memory via idle threads in the pool
Well, to be specific, I was battling leaking file descriptors today. Once sidekiq got going (with 150 concurrency) you could just watch the open file descriptors pile up until the errors started.
I'm only beginning to read about Celluloid right now, but in it's self-healing ways, does it replace threads in the pool with new instances if there are exceptions?
Celluloid recycles the thread regardless of whether or not an exception occurred. It has a toplevel exception handler running inside each thread in the pool.
At concurrency 150, you'll end up using a lot of file descriptors. Depending on what else you're doing, I could easily see something like 750 (statsd, database, redis, https). That should still be inside the limits for a default linux setup, but ulimit on osx defaults to 256.
As far as thread recycling goes, even without being cleared out, this will not be a problem. You may end up with 150 udp bindings, but that's not the end of the world. Given that more recent versions of celluloid will clear out the locals, I would recommend binding to a single actor, as I discussed before, and giving a dedicated thread (if that's possible). This would also go some way toward preventing the mailbox overflow I described.
Also, on MRI, I'm not really sure you want to be running with such a high concurrency. It depends a lot what your specific jobs are, but I'd propose knocking that down to 15-50, and observe throughput. I'd bet throughput is higher in the lower concurrency region.
Tony and James, thanks for taking the time to help me sort this out, and for all the work you do. Just wanted to say that first :)
Our descriptor limit is 1024, but I could watch our sidekiq process just keep gaining and gaining until it hit the "Statsd: Errno::EMFILE Too many open files - socket(2) - udp" errors. Without Statsd in place we stayed inline with where we'd normally be with connections to mongo and redis.
I really like the statsd actor idea, I'm going to take a crack at that here shortly.
I'm trying to isolate what I was seeing in production a reproducible sample here. We'll see.
Thanks again!
@tarcieri You mentioned that Celluloid cleaning up thread locals. Is that in 0.12.4? I ask because, if Celluloid is clearing out thread locals then the UDPSocket that Statsd setup will not be closed when it is removed from the local storage. Which could cause the descriptor leaking I was seeing.
It's not in 0.12.4, only in master (0.13)
In the end you really need to find a way to ensure the UDP sockets are getting closed. Celluloid lets you define a finalizer which could be used to close sockets when actors are terminated, for example.
For an actor-based system like Celluloid, I think using an actor (or pool of actors) to manage Statsd as @raggi mentioned makes the most sense. The Statsd actor can consume stats messages off of its mailbox and then use batch delivery for efficiency and none of the other actors care about the concurrency or threadsafety or resource allocation/deallocation concerns. More encapsulation, safer concurrency, (probably) more performance.
Actually, looking at Sidekiq, I see that it handles Processor failure by linking a new one in...
https://github.com/mperham/sidekiq/blob/v2.6.4/lib/sidekiq/manager.rb#L75-L87
trap_exit :processor_died
....
def processor_died(processor, reason)
watchdog("Manager#processor_died died") do
@in_progress.delete(processor.object_id)
@busy.delete(processor)
unless stopped?
@ready << Processor.new_link(current_actor)
dispatch
else
signal(:shutdown) if @busy.empty?
end
end
end
Sorry, still trying to track down why the threads and sockets aren't being reused as they should.
Sorry, still trying to track down why the threads and sockets aren't being reused as they should.
This seems as good a place to have the conversation as any. It's not like you're hurting anyone. :smile:
Yep, I only closed it for my own tracking purposes, I'll keep an eye on the email thread :-)
Thanks guys :)
@tarcieri Do you think the Sidekiq code above, given Celluloid 0.12.4, could cause the Thread local UDPSocket leaking I was seeing?
Thanks again, -Ray
@raykrueger yeah, thanks to Fibers even if you use @raggi's suggested approach it will make a new socket for each fiber :(
Aah :( ok thanks! So it seems my best bet is the Sidekiq middleware I mentioned then. I did actually get a Statsd actor working, but it was causing deadlocks when Unicorn initialized. That's a tale for another day though heh.
If you use @raggi's approach, and call the "exclusive" class method, you can at least eliminate the Fibers from ganking your thread locals.
Hey guys, I was doing other work for a while but now it's time to drag this out of the dust again :)
@tarcieri Given that sidekiq handles crashed actors by starting new ones is this a bug in sidekiq really? This is the part where I drag @mperham in :)
So it turns out, we've been leaking MongoDB connections for a long time as well. We're using mongodb-ruby 1.7.1, which uses a connection pool. When we're seeing workers fail, eventually we see mongo-ruby complaining that there are no available connections in the pool. Mongo-ruby also uses thread local storage to manage it's connection pool.
Again, we're on ruby 1.9.2 running sidekiq 2.6.4 and celluloid 0.12.4 on ubununtu 11.10 in AWS.
P.S. If your'e ever in Chicago I will buy you beer :)
@raykrueger Definitely not a Sidekiq bug. That's the intended use pattern for Celluloid.
If anything this is a Celluloid bug. I have some ideas about how to fix it but have not had time to implement them.
@tarcieri Tony, I was thinking about this again today. One thing that keeps coming to mind is the threadpool. If each thread were being correctly returned to the threadpool we wouldn't end up leaking the statsd udpsockets. When a worker crashes and a none is started it should, theoretically, be able to re-use any threadlocals in that thread (such as the udpsocket in question here, mongo connections, redis connections).
Is it possible that celluloid is killing the thread with the worker and starting a new thread for the new worker? I dug into this a while back but don't remember what I ended up finding heh.
There's a number of problems with the way Celluloid works now, most notably not providing a consistent view of thread locals across "Tasks" within the same actor.
Have a fix in mind for this, but it probably won't make it into Celluloid until 0.14 (release after the upcoming one)
We are having this same problem using jruby + trinidad. In Java, even when the threads are finished the UDP connections are left open, which is causing us the same file descriptor problems described above.
Add the middleware you see above and you'll be fine, and you'll have timings around all your jobs :P
Note that Celluloid 0.14 includes a fix for thread local behavior which should correct this problem under Celluloid at least. Prereleases are up on RubyGems, full release soon!
Thanks guys, just one more clarification. Why not use a connection pool as suggested earlier to manage the UDP connections?
At least on Tomcat/Trinidad there seems to be some issues on keeping resources on the ThreadLocal.
http://tomcat.10.x6.nabble.com/Executor-thread-lifecycle-td1995175.html
Specifically:
> I'm using ThreadLocal to allocate some resources per thread...
That is asking for a memory leak. There is a simple rule for correctly
using a ThreadLocal in a web application.
The ThreadLocal must be removed from the thread before the
request/response in which it was created completes processing.
Any thoughts on this?
ThreadLocals for the UDPSocket are fine if there's a properly managed thread pool in use. The existing UDPSocket can be re-used safely. The UDPSockets aren't really leaked, the threads are, and since the sockets are attached as locals they go out with the bathwater.
I'm not clear on this but it seems Celluloid doesn't back the actors with a threadpool. I think a new thread is started for each actor. So when a sidekiq worker (actor) dies, a new one is started and the thread is leaked (along with the locals).Tony might say all this is completely wrong though :)
Celluloid does use a thread pool, however the thread locals are cleared when threads are returned to the pool.
Ah ha! I don't know how I didn't pick up on that in all the previous conversation :)
Is there any reason for not creating a UDPSocket.new each time it is required, instead of holding a reference to it on the Thread?
That is the effective result of my middleware actually.
OK, now it makes sense. Thanks!
@elmer-garduno Good question. I'd rather do away with thread locals (because they're so often not) if possible. @raggi, thoughts?
Running into these issues now.. we're using Celluloid (0.15.2) and jRuby with the latest gem of statsd-ruby.. so many sockets are created that it ends up hitting limits and crashing!
After some experimentation and testing, something like @raykrueger's change to use a single socket, along with an exclusive celluloid actor, seems to be working...
https://github.com/reinh/statsd/issues/42 and https://github.com/reinh/statsd/issues/31 have both been closed (with good reason)... Would a PR that allows for configurable options for each/either of those methods, with default behavior as-is, be a good thing?
Also, @tarcieri, could you help shed some light on the socket creation based on thread variables for me? We're on a current celluloid, after the https://github.com/celluloid/celluloid/commit/395a6dff992a16609baa94331b50c4848678967d commit, but whenever it gets here, it creates a new socket anytime.
Here's our actor we send the stats through:
class Stactor
include Celluloid
exclusive
attr_accessor :statsd
def initialize(statsd)
@statsd = statsd
end
def send_stat(meth, *args, &blk)
statsd.send meth, *args, &blk
end
def method_missing(meth, *args, &blk)
self.async.send_stat meth, *args, &blk
end
end
Ruby has wacky thread local semantics. The behavior of Thread.current#[]
should probably have been Fiber.current#[]
instead, but alas, to get "real thread locals" @tenderlove had to add Thread.current.thread_variable_get
and Thread.current.thread_variable_set
, which is probably what you want here, but unavailable before Ruby 2.0.
Also Ruby doesn't support destructors for thread locals, so it's easy for stuff like this to leak.
Ran into some issues today with our Sidekiq/Statsd setup. We have a nice sidekiq middleware component that wraps each job in a "time" block; it works great.
The problem is that, under load, garbage collection does not get to run and all of the UDPSockets attached to threads are left behind.
To remedy this I added a tweak to the middleware to clean up the UDPSocket after the job finishes.
This is then added into Sidekiq as a server middleware...
This may be applicable to other threading/forking situations like unicorn and passenger as well.
I could package this up as an optional class or something, not sure if that's going too far. If nothing else, a wiki page might be good.