Max grok pattern size - Githubissues

logstash-plugins / logstash-filter-grok

Grok plugin to parse unstructured (log) data into something structured.

Apache License 2.0

124 stars 97 forks source link

We have a logstash grok filter with a single match => {message => '%{PATTERN}'}, where PATTERN is made out of several other patterns joined with | (i.e. a grok file with PATTERN %{PAT1}|%{PAT2}; and each sub-pattern also a combination of more patterns).

recently, we added an new pattern to the joined list, and logstash starts to consume large amounts cpu after a while (like 30 minutes or so, and it parsed a few messages with the new pattern, so the new pattern itself seems fine).

but maybe we hit some internal threshold/buffersize/.... is there a limit to the size of a single pattern in the match => message? we could split the patterns and use match => { message => ['PAT1', 'PAT2', ...] }, but would it improve anything?

i also found #37, but i don't think it's related. the pattern does have a GREEDYDATA at the end, but because it is at the end, it shouldn't matter i think (the new pattern looks like uid:%{INT:uid:int} sid:%{INT:sid:int} tty:%{DATA:tty} cwd:%{UNIXPATH:cwd} filename:%{UNIXPATH:executable}: %{GREEDYDATA:command})

For future reference: the bug is that UNIXPATH has no +, and the above grok pattern makes logstash hang using following input data:

uid:1234 sid:5678 tty:(none) cwd:/som/path/long/long/long/++/mor/long/path/anonymized/file filename:/bin/cut: cut -d. -f2

The same issue occurs if eg the cwd has a space or something. switching from UNIXPATH to DATA fixed the issue.

Tips for the logstash people: it took me half a day to get the relevant message, and 20 minutes to figure what was wrong with the pattern.

support a timeout and when the timeout is hit log the message
if no timeout, provide a way to dump the event queue e.g. via a signal, so when logstash hangs, we get the messages involved.

logstash-plugins / logstash-filter-grok

Max grok pattern size #69