Processing deep stack traces can result in denial of service

noahw3 commented 2 months ago

I've recently had to debug our production servers crashing, and finally came to the realization that it's due to rollbar attempting to process errors. It's possible for rollbar to pin the CPU at 100% for minutes at a time, increasing memory usage until the entire instance or container gets OOM'd. This occurs when the stack trace that it's trying to process is deep, with a 200 deep stack trace taking a couple of minutes to process and a 500 deep stack trace crashing the container.

We're using rollbar with flask, almost identically to how it's shown in the example.

The culprit is attempting to trace and add the local variable state. Adjusting the settings to disable local variable collection completely resolves this issue, however a) it's enabled by default and b) the entire value that we get out of rollbar is from including the local state, so if we're going to globally disable it we may as well remove rollbar entirely.

After digging into it there are two expensive steps: _add_locals_data as it consumes the entire stack trace, and _build_payload when it attempts to apply transforms to the body.

Even if the above steps do finish eventually, it all ends up being wasted work anyway because the payload size is hundreds of megabytes if not gigabytes, for which the API times out or rejects (see related issue https://github.com/rollbar/pyrollbar/issues/237).

It's not entirely clear to me what the best approach is here. One possibility would be to look at the depth of the stack trace and not attempt to collect local data at all if it's over a certain threshold. Another possibility would be to try and collect some data, but return early once it reaches a certain size. Ideally it seems like the sanitization steps would be done in tandem with the collection so that the current payload size could be compared to the maximum payload size at any given time.

zdavis-rollbar commented 2 months ago

Thanks for pointing this out, @noahw3. We will look into this and get back to you shortly.

danielmorell commented 2 months ago

Hey Noah, thank you for the detailed and thoughtful explanation of the issue you are facing. I had a conversation with someone internally a few weeks ago where we came up with some similar ideas to resolve the issue. It is nice to see people thinking the same way!

I think possibly instead of completely disabling local data collection for large stack traces we could restrict it to the frames just before the error. This would retain some of the most valuable data in the report.

Your idea of bailing out of the data collection process once we hit our API limit is a great idea! I love it!

Since, applying transforms and running the sanitization logic can change the final size of an object, I don't think we could calculate the size of the stack trace compared to the API limit with complete accuracy. That is unless, as you suggest, we applied them during the data collection phase.

Overall, I think this is the right direction. However, this may take some planning and time to implement. And like you I can't think of a great way to solve the issue you are facing short term. I will give it some more thought though.

zdavis-rollbar commented 1 month ago

We are working on this, and it's being reviewed.

zdavis-rollbar commented 1 month ago

@noahw3 we just pushed a new alpha version that addresses this issue. Thanks for bringing it to our attention. Please give this new version a try and let us know if it solved your issue.

danielmorell commented 2 weeks ago

Hey @noahw3, have you had been able to try the alpha, and do you have any feedback on how it has performed.

noahw3 commented 2 weeks ago

@danielmorell I haven't, sorry. It's still on my radar though, just been busy with other priorities.

rollbar / pyrollbar

Processing deep stack traces can result in denial of service #447