logstash-plugins / logstash-filter-grok

Grok plugin to parse unstructured (log) data into something structured.
https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html
Apache License 2.0
124 stars 98 forks source link

Grok: Maybe allow defining the type *in* the pattern definition #26

Closed jordansissel closed 9 years ago

jordansissel commented 9 years ago

(This issue was originally filed by @jordansissel at https://github.com/elastic/logstash/issues/1859)


Problem: Many users do things like %{NUMBER:bytes} in grok and then are confused why Elasticsearch fails to do statistics or other numeric aggregations on it. The cause is that Grok only does strings by default and Elasticsearch is sent a string and maps 'bytes' to a string - and this is confusing.

I'm tired of users tripping over this problem. I would be willing to add a feature to grok that allowed you to define the 'type' of a pattern inside the pattern definition.

Background: In a grok patterns file, you can define a pattern with NAME PATTERN syntax (name of pattern, space, the regexp pattern).

Proposal: Allow the type to accompany the NAME.

By way of example, if we were to fix this NUMBER problem permanently, we would define the new pattern like this:

NUMBER:float (%{BASE10NUM})

The new syntax is NAME:TYPE REGEXP and is backwards-compatible with the old syntax (The :TYPE is made optional and defaults to string if not provided).

This would allow us to more reasonably define the patterns with their respective types such that this will be captured as a numeric type in Elasticsearch: %{NUMBER:bytes}

It's not clear if this will solve everything, though, since in some cases like 'bytes' the value is never fractional, so users doing %{NUMBER:bytes} and seeing a float may be confused because they wanted to see a long type in Elasticsearch.

Thoughts?

avleen commented 9 years ago

We solved this a different way internally, by using notation on the Elasticsearch mapping. We have the following dynamic field mapping:

"dynamic_templates": [
    {
        "short_template" : {
            "match" : "s_*",
            "mapping" : { "type" : "short", "index" : "not_analyzed" }
        }
    },
    {
        "long_template" : {
            "match" : "l_*",
            "mapping" : { "type" : "long", "index" : "not_analyzed" }
        }
    },
    {
        "double_template" : {
            "match" : "d_*",
            "mapping" : { "type" : "double", "index" : "not_analyzed" }
        }
    },
    {
        "bool_template" : {
            "match" : "b_*",
            "mapping" : { "type" : "boolean", "index" : "not_analyzed" }
        }
    },
    {
        "string_template" : {
            "match" : "*",
            "mapping": { "type": "string", "index": "not_analyzed" },
            "match_mapping_type" : "string"
        }
    }
],

If a field name starts with l_, Elasticsearch maps it to a long. If it starts with b_, it's a boolean. It requires more work on the end user to make it happen, and will require people to rename their fields. But it's performant and reliable.

jordansissel commented 9 years ago

@avleen +1 on your solution, at least for now.

I'm content to close this. We can revisit or continue the discussion any time later :)

q2dg commented 3 years ago

What about "ip_*" notation for IPv4 fields? Thanks!