Resource leaks, timeouts when using the default "thread" reporting mechanism: new thread created for each message

bsmedberg-xometry commented 2 years ago

We tried adopting pyrollbar into our production applications yesterday, and had to disable it for causing timeouts and other resource issues.

After investigation, I believe that the problem is in the way the default thread reporting mechanism works:

upon report, it launches a completely new thread for each item that is submitted: https://github.com/rollbar/pyrollbar/blob/4eb8a80a04abb66784d187fe1b6f04680a591d3a/rollbar/__init__.py#L1508
it stores references to these threads in a global Queue object _threads
when the thread runs, it pulls something off the _threads Queue... but there's no guarantee that it pull itself off the queue; it could be pulling any thread off the queue!

The purpose of the global _threads queue is to allow the code to .join() the queue at shutdown (although this mechanism is only used for lambdas).

I'm pretty sure that this mechanism is leaking either actual OS threads or (more likely) python objects and resources like mutexes; I'm not 100% sure how, but the fact that threads don't pull themself out of the queue is likely the problem.

In any case, this threading mechanism is very inefficient: individual threads a heavyweight mechanism for queuing small tasks. It would be much better to have a single background thread or a small background thread pool that processes a queue of items to post; this work thread(s) poll/block on the queue and post items.

danielmorell commented 2 years ago

Let me investigate this a bit more. But initially, I think using a tread pool might be the best option.

ghost commented 1 year ago

The fix is on its way, but it takes some time as touching the thread pool involves some risk. Alternatively, you can try using our httpx or async handler. Using one of those would entirely bypass the new thread issue. The handler can be set by passing the handler argument to the init() function: rollbar.init(handler='async')

danielmorell commented 1 year ago

Hey @bsmedberg-xometry, the beta release with the fix for this is out https://pypi.org/project/rollbar/0.16.4b0/. Please test it out and let us know how it performs.

bsmedberg-xometry commented 1 year ago

Thank you! We can probably get this into our stage environment very quickly, but probably can't test under load until next week after month close.

danielmorell commented 1 year ago

That makes a lot of sense. If you have not read the write up in the PR, I would highly recommend it. It explains the impact of your Python version on the default number of worker threads in the pool.

If you have any issues, thoughts, or discoveries please let me know. I am happy to help!

ghost commented 1 year ago

Hey @bsmedberg-xometry, We just had an internal discussion where I was informed that the threading error didn't get resolved with v0.16.4b0. Can you share more details on what went wrong exactly? We're happy to investigate this issue further.

ghost commented 1 year ago

We released a new batch payload transformer with v0.16.4b1, that should resolve this issue. @bsmedberg-xometry Cany you test and validate if it works for you? Let me know if you need any help.

danielmorell commented 2 months ago

This was resolved a while ago.

rollbar / pyrollbar

Resource leaks, timeouts when using the default "thread" reporting mechanism: new thread created for each message #412