logstash-plugins / logstash-filter-dissect

Extract structured fields from an unstructured line
Apache License 2.0
13 stars 22 forks source link

Add support for multiple dissect patterns on a field like Grok #56

Open guyboertje opened 6 years ago

guyboertje commented 6 years ago

There are a few drivers for this.

People are familiar with this from Grok. Beats and Ingest Node would like support Dissect style de-structuring. Grok classifier in ML would like to support it. It would simply some configs, see this for more info:

dissect {
  mapping => {
    "message" => "%{} %{message}"
  }
}
if [message] =~ /^id/ {
  dissect {
    mapping => {
      "message" => "id=%{imp_id} action=%{imp_action} wf=%{imp_wf} ip=%{imp_ip} from=%{imp_from} to=%{imp_to} %{message}"
    }
  }
  if [message] =~ /size/ {
    dissect {
      mapping => {
        "message" => "size=%{imp_size} filters=%{imp_filters}"
      }
    }
  } else if [message] =~ /filters/ {
    dissect {
      mapping => {
        "message" => "filters=%{imp_filters}"
      }
    }
  }
} else if [message] =~ /^sid/ {
  dissect {
    mapping => {
      "message" => "sid=%{imp_sid} ip=%{imp_ip} action=%{imp_action} wf=%{imp_wf} smpt=%{imp_smtp} %{message}"
    }
  }
}

To (suggestion):

dissect {
  break_on_match => false
  # cascading mutation of message field
  mapping => {
    "message" => [
      "%{} %{message}",
      "sid=%{imp_sid} ip=%{imp_ip} action=%{imp_action} wf=%{imp_wf} smpt=%{imp_smtp} %{message}"
      "id=%{imp_id} action=%{imp_action} wf=%{imp_wf} ip=%{imp_ip} from=%{imp_from} to=%{imp_to} %{message}",
      "size=%{imp_size} filters=%{imp_filters}",
      "filters=%{imp_filters}"
    ]
  }
}
cdahlqvist commented 6 years ago

I am not sure that is a very common example. In that case I would use dissect to parse out the full kv list and then apply the kv filter to it.

I think a more common use case is when you have a log file with a number of different log line formats in it and you want to try these against a list of dissect patterns in sequence and break when a match is found, similar to how grok works.

I picked some sample PAM logs from https://ossec-docs.readthedocs.io/en/latest/log_samples/auth/pam.html

Jul  7 10:51:24 srbarriga su(pam_unix)[14592]: session opened for user test2 by (uid=10101)
Jul  7 10:53:07 srbarriga su(pam_unix)[14592]: session closed for user test
Jul  7 10:55:56 srbarriga sshd(pam_unix)[16660]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=192.168.20.111  user=root

If I wanted to parse these using dissect, they all have slight variations in format. If multiple patterns were allowed and matching would break after first success, something like this could work:

dissect {
  break_on_match => true
  mapping => {
    "message" => [
      "%{ts->} %{+ts} %{+ts} %{host} %{command}(pam_unix)[%{pid}]: %{action} %{+action} for user %{user} by (uid=%{uid})%{}",
      "%{ts->} %{+ts} %{+ts} %{host} %{command}(pam_unix)[%{pid}]: %{action} %{+action} for user %{user}",
      "%{ts->} %{+ts} %{+ts} %{host} %{command}(pam_unix)[%{pid}]: %{action} %{+action}; %{params}"
  }
}

Maybe this scenario could be handled by the cascading as well?

ph commented 6 years ago

I think the best way to implements it as @guyboertje proposed is to add a new sequence option in the dissect filter that will support multiple definition of dissect/mapping in an array instead of a hash.

This would reflect the behavior of definining multiple dissect plugin in the configuration and will be backward compatible.

dissect {
  sequence => [
{
    break_on_match => false,
    field => "message",
    tokenizer => [
      "%{ts->} %{+ts} %{+ts} %{host} %{rest}"
  },
{
    break_on_match => true,
    field => "rest",
    tokenizer => [
      "%{command}(pam_unix)[%{pid}]: %{action} %{+action} for user %{user} by (uid=%{uid})%{}",
      "%{command}(pam_unix)[%{pid}]: %{action} %{+action} for user %{user}",
      "%{command}(pam_unix)[%{pid}]: %{action} %{+action}; %{params}"
  }]
}
guyboertje commented 6 years ago

As discussed with @ph, this is a variation on using sequence but is less cryptic. It also adds target and clarifies the different between breaking out of the patterns (tokeniser) vs breaking out of the sequence; plus tags to trace which patterns in the sequence matched.

dissect {

  # Jul  7 10:52:14 srbarriga sshd(pam_unix)[17365]: session opened for user test by (uid=508)
  # Nov 17 21:41:22 localhost su[8060]: (pam_unix) session opened for user root by (uid=0)
  # Nov 11 22:46:29 localhost vsftpd: pam_unix(vsftpd:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=1.2.3.4

  target => "captured_fields"
  sequence => [
    {
      source => "message"
      target => "inner_fields"
      patterns => [
        # always breaks on match of a pattern, but continues with sequence unless stopped
        {
          pattern => "%{ts->} %{+ts} %{+ts} %{host} %{message}"
          tags => ["pam_format_common"]
          stop_sequence_on_match => false # default
        }
      ]
    },
    {
      source => "[inner_fields][message]"
      patterns => [
        {
          pattern => "%{command}(pam_unix)[%{pid}]: %{rest}"
          tags => ["pam_format_1"]
          stop_sequence_on_match => true
        },
        {
          pattern => "%{command}[%{pid}]: (pam_unix) %{message}"
          tags => ["pam_format_2"]
          stop_sequence_on_match => false
        },
        {
          pattern => "%{command}: pam_unix(%{process_name}): %{message}"
          tags => ["pam_format_3"]
          stop_sequence_on_match => false
        }
      ]
    },
    {
      source => "[captured_fields][message]"
      patterns => [
        {
          pattern => "%{action} %{+action} for user %{user} by (uid=%{uid})%{}"
          tags => ["pam_format_for_user"]
          stop_sequence_on_match => false
        },
        {
          pattern => "%{action}; %{kv_params}"
          tags => ["pam_format_kv"]
          stop_sequence_on_match => false
        }
      ]
    }
  ]
}
guyboertje commented 6 years ago

@ph and I decided that we should explicitly support the idea that a pattern is "anchored" to the start of the field value. A pattern of: "---BEGIN---%{field1} %{field2}" should match a value string of: "---BEGIN---foo bar" and should NOT match a value string of: "some preamble ---BEGIN---foo bar".

A leading skip field should be used if there is any chance that a value string can have some unknown content before the known ---BEGIN--- delimiter. "%{}---BEGIN---%{field1} %{field2}"

johncollaros commented 5 years ago

Hi,

Are there any updates for this enhancement? I am planning to implement dissect for an upcoming project.

Thanks

Chadwiki commented 3 years ago

Could we get an update on this feature request?

karenzone commented 3 years ago

I will raise visibility and try to get an update for you.

kares commented 3 years ago

No updates on the issue in terms of having an actual implementation - isn't supported in latest dissect plugin. At this point, we're happy to review PRs if anyone has a take on the feature.