logstash-plugins / logstash-filter-grok

Grok plugin to parse unstructured (log) data into something structured.
https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html
Apache License 2.0
124 stars 98 forks source link

Feature Request: Recursively Grok lines and streams #50

Closed naisanza closed 9 years ago

naisanza commented 9 years ago

I'm looking to use grok to parse through lines and streams of data. I'll explain how.

Let's say I have a line of data of:

221.37.88.36.bc.googleusercontent.com,63.88.73.122,63.88.73.0,-,,-,Google Inc.,Mountain View,CA,US,Google Inc.,Mountain View,CA,US

We can see there are some noticeable information in there, such as IP addresses, hostnames, city, state, country.

I'm trying to make a grok parser to extract data out of this line incrementally, where through each grok filter it will remove what was parsed out and feed the remainder into the next grok filter.

For example:

Let's take an input from a TCP port

input {
   tcp { port => "4382" }
}

And feed it through grok

filter {
    # GROK PARSER 01
    grok {
        match => { "message", "%{HOSTNAME:Hostname}" }    # This will parse out all hostnames from the line
    }
    # GROK PARSER 02
    grok {
        match => { "message", "%{IPV4:IP}" }    # This will parse out all IPV4 address from the line
    }
    # GROK PARSER 03
    grok {
        match => { "message", "%{GREEDYDATA:data}" }    # This will encapsulate the rest of the information
    }
}

After GROK PARSER 01, we'll end up parsing out anything that is a hostname

            "message" => [
        [0] "221.37.88.36.bc.googleusercontent.com,63.88.73.122,63.88.73.0,-,,-,Google Inc.,Mountain View,CA,US,Google Inc.,Mountain View,CA,US"
    ],
           "@version" => "1",
         "@timestamp" => "2015-07-15T19:33:46.261Z",
           "Hostname" => "221.37.88.36.bc.googleusercontent.com"
}

Then, GROK PARSER 02, will parse any IPV4 address

            "message" => [
        [0] ",63.88.73.122,63.88.73.0,-,,-,Google Inc.,Mountain View,CA,US,Google Inc.,Mountain View,CA,US"
    ],
           "@version" => "1",
         "@timestamp" => "2015-07-15T19:33:46.261Z",
                 "IP" => [
        [0] "63.88.73.122",
        [1] "63.88.73.0"
    ]
}

And, lastly, GROK PARSER 03, will hold what's left

            "message" => [
        [0] ",,,-,,-,Google Inc.,Mountain View,CA,US,Google Inc.,Mountain View,CA,US"
    ],
           "@version" => "1",
         "@timestamp" => "2015-07-15T19:33:46.261Z",
               "data" => ",,,-,,-,Google Inc.,Mountain View,CA,US,Google Inc.,Mountain View,CA,US"
}

How can we make this happen?

jordansissel commented 9 years ago

I don't mean to distract from your question, but your format appears to be comma-delimited and is something the csv filter is great at!

naisanza commented 9 years ago

@jordansissel The csv filter is great! But for each particular csv! Which dawned on me while I was building a new filter for every new csv.

naisanza commented 9 years ago

Also, the current grok filter doesn't try to match across the entire line. Once the pattern match is satisfied once, it is fulfilled. So, with the example above, grok will only parse out one IPV4 address; not parsing out the second IPV4 address.

If grok can take the pattern and extract every match throughout the entire log line, not just once, then that's all I need. This is especially useful if the structure of the log is unknown and I'd like to just extract all IPV4's from the line

jordansissel commented 9 years ago

Closing in favor of tracking this discussion in an older similar ticket, #35