opensearch-project / data-prepper

OpenSearch Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
267 stars 206 forks source link

Extract values from Grok with the correct type #2009

Open dlvenable opened 2 years ago

dlvenable commented 2 years ago

Is your feature request related to a problem? Please describe.

The grok processor currently creates all Event values as strings. For example, when grokking on an Apache HTTP log, all response values are strings. This prevents a pipeline author from creating conditional routing expressions which perform comparisons such as /response < 500.

Describe the solution you'd like

The grok processor can have two options to help pipeline authors.

  1. Manual configuration of pattern types.
  2. Automatic conversion of pattern types for pre-defined patterns.

Manual configuration

Provide a configuration that allows the grok processor to convert specific patterns. This new configuration - conversions - would take a map of patterns to destination types.

For example:

grok:
  conversions:
    INT: integer
    NUMBER: decimal
    MY_CUSTOM_NUMBER: integer

Automatic configuration

Provide a setting that allows the grok processor to automatically convert specific patterns which it has pre-included. The grok processor has some default patterns like INT. Most pipeline authors probably want these to automatically get the correct type. The grok processor can automatically convert these known patterns.

This would be a change of behavior. So, I propose that this configure be disabled by default, but in a future major version we would enable it.

Thus, to use it in Data Prepper 2.0.

grok:
  disable_automatic_conversion: false

But, perhaps in Data Prepper 3.0, the default value here becomes false. So pipeline authors no longer have to specify it.

Describe alternatives you've considered (Optional)

Ask pipeline authors to use a casting processor as requested in #2010. The solution using grok can be easier for pipeline authors, especially with an automatic conversion.

kkondaka commented 2 years ago

Just now realized that this functionality already exists. The grok processor has the capability to handle things like %{NUMBER:bytes:int} which will generate the field "bytes" with "integer" value. If "int" is omitted, the type will be of "string" value.

dlvenable commented 11 months ago

Somewhat similar issue: #3918