logstash-plugins / logstash-filter-grok

Grok plugin to parse unstructured (log) data into something structured.
https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html
Apache License 2.0
122 stars 97 forks source link

Significant Performance Regression using Jruby Timeouts #152

Open yaauie opened 4 years ago

yaauie commented 4 years ago

I've run a number of performance tests to get the raw throughput per-worker to determine how different variables affect the throughput of the grok, and determined that the changes to timeouts definitely have an outsized impact on grok filters that fall through many patterns to find their match (which makes sense, because the plugin initiates a timeout on each event for each pattern on until we find a match).

6.7.1 7.4.1 JEE 7.4.1 REE
grok-single-pattern 97k 65k 57k
grok-single-pattern-NT 96k 105k 98k
grok-52-fallthrough 42k 15k 17k
grok-52-fallthrough-NT 42k 71k 68k

_Caveat: In the above, NT indicates "No Timeout", or timeout_millis => 0, which we do NOT recommend for production since a single pathological input can essentially tie up CPU resources indefinitely)_

I will follow up with methodology for the above, and will see if I can get a better apples-to-apples comparison with 6.7.1 using the same version of this plugin as ships with 7.x.

colinsurprenant commented 4 years ago

@yaauie interesting. in #147 I did not see such a dramatic performance regression for single pattern use-case; also I did not test for multiple fall through which, as you point out, will multiply any performance regression. I wonder if using different LS versions also affect the performance tests? Did you use the same LS version for all tests with the different plugin versions?

I wonder if it would make sense to have a single global timeout for the whole fall through list? (it could be configurable to keep BWC?)

colinsurprenant commented 4 years ago

Meta issue created in elastic/logstash#11302

kares commented 4 years ago

with https://github.com/logstash-plugins/logstash-filter-grok/pull/153 merged and available since >= 4.2.0 the fall-through case should be better when using timeout_scope => "event". for the single pattern the reported numbers still look like a considerable regression but we have yet to see how much that matters in real-world usage.

options we have left (if more performance is needed) are the following :

percygrunwald commented 4 years ago

We have run into this performance regression after upgrading from 6.7.1 to 7.4.2. Given the same configuration, the update to 7.4.2 approximately doubled our event latency and dropped our events emitted by around 40%: image

There was no way to improve the performance of our pipelines after the upgrade by changing any of JVM heap size, pipeline workers or batch size. It seems that the drop in throughput was in large part due to a bottleneck in the grok filter. Given the same settings as before the upgrade, we were able to recover some of our pre-upgrade throughput by adding timeout_millis => 0 to all grok invocations in our pipeline definitions: image

We tested timeout_scope => "event" (instead of timeout_millis => 0) on some other machines and it had a very slight positive impact on overall performance: image

kares commented 4 years ago

Hey Percy, thanks for the data -> helpful to know timeout_scope => event did not help your case. We are considering options on tackling this without reverting to ('problematic' but fast) previous timeout implementation. It will take us a while to have smt shipable, we might need folks to do a 'beta' test when ready (as real-world scenarios such as yours might reflect performance changes more accurately).

percygrunwald commented 4 years ago

@kares thanks for the follow up, appreciate your work on this. Our setup is holding steady with just the addition of timeout_millis => 0 to all our grok filters presently, but would prefer a more permanent solution that doesn't require us to add this to each invocation of the filter.

abdalians commented 4 years ago

+1 on timeout_millis => 0

percygrunwald commented 3 years ago

@kares has there been any progress on this case? We still have a code base full of timeout_millis => 0 as a mitigation attempt for this.