logstash-plugins / logstash-filter-grok

Grok plugin to parse unstructured (log) data into something structured.
https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html
Apache License 2.0
124 stars 98 forks source link

Passing an array of patterns to grok does not match the second pattern (of two). #43

Closed jarpy closed 9 years ago

jarpy commented 9 years ago

The new hash syntax for 'match' can take an array of patterns, but if two patterns are provided, the second is not matched.

Here is a shell "one-liner" demonstrating the negative case:

echo 'banana 7' | /opt/logstash/bin/logstash -e '
input {
  stdin { }
}

filter {
  grok {
    match => {
      "message" => [
        "%{WORD:word}",
        "%{POSINT:number}"
      ]
    }
  }
}

output {
  stdout {
    codec => rubydebug
  }
}
'
{
       "message" => "banana 7",
      "@version" => "1",
    "@timestamp" => "2015-06-09T04:28:55.602Z",
          "host" => "metrics.localdomain",
          "word" => "banana"
}

...and one for the positive case, using a different syntax:

echo 'banana 7' | /opt/logstash/bin/logstash -e '
input {
  stdin { }
}

filter {

  grok {
    match => {
      "message" => "%{WORD:word}"
    }
  }

  grok {
    match => {
      "message" => "%{POSINT:number}"
    }
  }
}

output {
  stdout {
    codec => rubydebug
  }
}
'
{
       "message" => "banana 7",
      "@version" => "1",
    "@timestamp" => "2015-06-09T04:30:11.408Z",
          "host" => "metrics.localdomain",
          "word" => "banana",
        "number" => "7"
}
jarpy commented 9 years ago

Working on a patch.

This commit demonstrates a test and a (brutal, incorrect) patch that makes it pass. 417d993d7a964a9650030290e02213a63a16a4b2

Essentially, the default value of break_on_match = true makes grok bail out after the first array element is evaluated.

jarpy commented 9 years ago

Is break_on_match intended to be mainly a performance optimization? If so, should it default to false, so as not to be a premature one?

jordansissel commented 9 years ago

It wasn't chosen as a performance optimization. It was chosen because it is common to have applications with hundreds or thousands of unique log entries (Elasticsearch has over 1300 of them). This commonality allows you to write specific patterns to parse certain messages you care the most about and have a fall-back case (or several fall-back attempts).

This example also follows with things like syslog-ish formatted messages. One message you may care about the details:

Aug  7 14:10:27 crinkle sshd[10094]: Invalid user foo from ::1

And maybe parse this specially to indicate what user foo failed to login. Other syslog messages you may haven't seen yet or may not care to do additional parsing, but would like to have at least the common header (timestamp, host, app) parsed.

jarpy commented 9 years ago

Thanks Jordan. I get it now.

Appreciate you taking the time.