logstash-plugins / logstash-filter-kv

Apache License 2.0
17 stars 42 forks source link

KV Filter Generates Pathological Regex for Simple Config #77

Closed original-brownbear closed 5 years ago

original-brownbear commented 5 years ago

The following config

   kv {
        source => "PayloadParams"
        value_split => "="
        allow_duplicate_values => false
        target => "[powershell][param]"
        include_keys => [ "name", "value" ]
      }

results in

{:regex=>"/(?-mix:((?:\\\\.|[^= ])+))(?-mix:(?-mix: *)(?-mix:[=])(?-mix: *))(?-mix:(?-mix:(?-mix:\")(?-mix:((?:\\\\.|[^\"])+))?(?-mix:\"))|(?-mix:(?-mix:')(?-mix:((?:\\\\.|[^'])+))?(?-mix:'))|(?-mix:(?-mix:\\()(?-mix:((?:\\\\.|[^\\)])+))?(?-mix:\\)))|(?-mix:(?-mix:\\[)(?-mix:((?:\\\\.|[^\\]])+))?(?-mix:\\]))|(?-mix:(?-mix:<)(?-mix:((?:\\\\.|[^>])+))?(?-mix:>))|(?-mix:((?:\\\\.|[^ ])+)))?(?-mix:(?-mix:[ ])|(?-mix:$))/"}

being generated.

This regex appears to explode in runtime (as in locking up in a hot loop for minutes on a O(1k) length string) for longer strings with its overlapping sections. I think if the split pattern is not a regex but a constant string like = and there should be a way of not introducing overlapping matching sections here and get linear performance.

yaauie commented 5 years ago

@original-brownbear would you be able to share specific inputs that are pathological against the generated regexp?

A lot of the complexity in that pattern is the support of arbitrary whitespace surrounding the value-split (which can now be disabled with whitespace => strict) and the support for bracket-wrapped values (which can be disabled with include_brackets => false).

original-brownbear commented 5 years ago

@yaauie unfortunately, I don't have the data that triggered this (this came out of a support request). Back then I recommended the two things you suggested (whitespaces and brackets) and it helped the situation. Still, I'm wondering if we can't potentially improve this situation somehow by either:

Acting on this:

I think if the split pattern is not a regex but a constant string like = and there should be a way of not introducing overlapping matching sections here and get linear performance.

or easier + more generally applicable:

At least, introduce a timeout like we have in Grok? In the specific report that lead to this bug, it was literally one out of 200k events that caused the whole pipeline to stall for a few minutes until it was processed (and I'd expect the situation of only a small fraction of inputs triggering this instead of most inputs to be pretty common) and it would likely be a "good enough" solution for those cases?

yaauie commented 5 years ago

I think if the split pattern is not a regex but a constant string like = and there should be a way of not introducing overlapping matching sections here and get linear performance.

Supporting a backslash-escapes throws a bit of a wrench in this, and although in theory I see where inputs that contain backslash-escapes could be caused to backtrack, I've been unable to manually create pathological inputs against configurations using single-char separators.

At least, introduce a timeout like we have in Grok?

I have introduced a PR for one in logstash-plugins/logstash-filter-kv#79.

Hopefully it will also help us to get a better understanding of the pathological inputs, since with timeouts they will make it to the outputs nicely tagged.

colinsurprenant commented 5 years ago

Note that we recently discovered that timeouts in both kv and grok just did not work per https://github.com/elastic/logstash/pull/10978. Now, with the working timeout enforcer the default 30s timeout should be honoured which will help diagnose runaway regex.

I will go ahead and close this issue, feel free to reopen if you think we could do more here.