logstash-plugins / logstash-filter-grok

Grok plugin to parse unstructured (log) data into something structured.
https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html
Apache License 2.0
124 stars 97 forks source link

Max grok pattern size #69

Open stdweird opened 8 years ago

stdweird commented 8 years ago

We have a logstash grok filter with a single match => {message => '%{PATTERN}'}, where PATTERN is made out of several other patterns joined with | (i.e. a grok file with PATTERN %{PAT1}|%{PAT2}; and each sub-pattern also a combination of more patterns).

recently, we added an new pattern to the joined list, and logstash starts to consume large amounts cpu after a while (like 30 minutes or so, and it parsed a few messages with the new pattern, so the new pattern itself seems fine).

but maybe we hit some internal threshold/buffersize/.... is there a limit to the size of a single pattern in the match => message? we could split the patterns and use match => { message => ['PAT1', 'PAT2', ...] }, but would it improve anything?

i also found #37, but i don't think it's related. the pattern does have a GREEDYDATA at the end, but because it is at the end, it shouldn't matter i think (the new pattern looks like uid:%{INT:uid:int} sid:%{INT:sid:int} tty:%{DATA:tty} cwd:%{UNIXPATH:cwd} filename:%{UNIXPATH:executable}: %{GREEDYDATA:command})

stdweird commented 8 years ago

For future reference: the bug is that UNIXPATH has no +, and the above grok pattern makes logstash hang using following input data:

uid:1234 sid:5678 tty:(none) cwd:/som/path/long/long/long/++/mor/long/path/anonymized/file filename:/bin/cut: cut -d. -f2

The same issue occurs if eg the cwd has a space or something. switching from UNIXPATH to DATA fixed the issue.

Tips for the logstash people: it took me half a day to get the relevant message, and 20 minutes to figure what was wrong with the pattern.