Missing field values cause dissected fields to be out of position

ppf2 commented 7 years ago

Sample log entry:

ASM:"4394740425750718628","","2017-03-01 22:38:15","HTTPS","","alerted"

Dissect:

message => 'ASM:"%{supportID}","%{attackType}","%{date_time}","%{protocol},"%{is_truncated}",%{msg}'

This results in:

      "protocol" => "",
      "attackType" => "2017-03-01 22:38:15",
       "date_time" => "HTTPS",
      "is_truncated" => "",
       "supportID" => "4394740425750718628",

See how the position of the dissected fields are "off" (shifted).

One workaround is to do a gsub to replace "" with something like "NOT_SET" before sending it to the dissect filter, etc..

Per chat with @guyboertje , dissect will not tolerate variable extra spaces - think where values are right padded with spaces to visually line up fields. Due to the padding, its by design.

Filing an enhancement request to see if we can handle this use case in the future. For now, it can be helpful to document the handling of fields that have (may) have no values in some log lines.

gshamov commented 7 years ago

+1 ; It would be great if the separators amalgamation could be switched off (and on) by an option to the dissect filter. There are cases when the separators amalgamation is very useful (spaces) but there are cases when it is not (say, a string of something-separated-values format with missing/empty values for some fields).

guyboertje commented 7 years ago

@ppf2, @jordansissel How about if I added a suffix to indicate that the delimiter following this field should be greedy? I'm thinking ->, meaning that users with space padded text have to opt in. It also means not having to commit the whole dissection to either consume-all or consume-one delimiter.

Example: Data:

2017-06-28 12:12:12                  SHORT: f1,,f3,,,f6

Mapping:

%{date} %{+date->} %{APP}: %{csv1},%{csv2},%{csv3},%{csv4},%{csv5},%{csv6}

What do you think?

guyboertje commented 7 years ago

^ pinging @ppf2 @jordansissel

ppf2 commented 7 years ago

@guyboertje Thx, somehow I missed this update :) So to confirm, with the proposed approach, will the above handle the scenario where the lines have 4 dates in the beginning, the 3 dates following the first one can sometimes be empty string (no consistency, sometimes 2nd is empty, sometimes 4th is empty, sometimes, 3th and 4th are empty, and sometimes 2nd and 4th are, etc..). And will the config be something like:

%{date} %{+date->} %{+date->} %{+date->} %{APP}: %{csv1},%{csv2},%{csv3},%{csv4},%{csv5},%{csv6}

guyboertje commented 7 years ago

@ppf2 - exactly the opposite. If you need greedy delimiter consumption you must opt in. For missing fields no suffix is needed. For delimiter padded variable length fields, use suffix to tell Dissect to consume any extra delimiters. As I see it the missing field case occurs far more often than fields padded by multiple delimiters.

ppf2 commented 7 years ago

Ah got it, was reading it wrong. Makes sense to me!

guyboertje commented 7 years ago

@ppf2 In your missing date example will the missing field always have two delimiters?

E.G.1 f1-f2-f3-f4-f5-f6
E.G.2 f1--f3-f4-f5-f6
E.G.3 f1-f2--f4-f5-f6
E.G.4 f1--f3--f5-f6

Then this will 'see' the gaps in fields.

%{fld1}-%{fld2}-%{fld3}-%{fld4}-%{fld5}-%{fld6}

Do you have a preference for the suffix or rather, what character(s) shout out greedy to you?

ppf2 commented 7 years ago

In your missing date example will the missing field always have two delimiters?

It can be any number of delimiters.

Don't have a preference for the character to use for the suffix :)

guyboertje commented 7 years ago

@ppf2

When I say two delimiters, I mean the declared delimiters on either side of a field that may or may not be present, i.e.

DISSECTION: %{f1}--%{f2}: %{f3}, %{f4}

DATA: aaa--bbb: ccc, ddd
FIELDS: f1 => "aaa", f2 => "bbb", f3 => "ccc", f4 => "ddd"

DATA: aa1--: cc1, dd1
FIELDS: f1 => "aa1", f2 => "", f3 => "cc1", f4 => "dd1"

Note the concatenated delimiters around the missing field --:.

guyboertje commented 7 years ago

Summary:

The proposed suffix -> is ONLY required when multiple delimiters are to be greedily consumed to reach the next non-missing field. This is due to some loggers 'padding' a field with spaces to create a better human reading experience. This means that Dissect will only seek to ONE occurrence of the next delimiter pattern and then it will know its on a field start boundary - when the field is missing, it will immediately seek to the occurrence of the next delimiter pattern (it finds this immediately) so its now on the field end boundary, for an empty field start == end.

Not greedy example

DATA: ",,,"
DISSECTION: %{f1},%{f2},%{f3},%{f4}
EVENT: {"f1" => "", "f2" => "", "f3" => "", "f4" => ""}

In the case of needing the greedy behaviour, the suffix will not appear in the field name when it creates the field. Greedy example (its saying, consume all the spaces between bar and baz)

Padding after the field
DATA: "foo bar             baz quux"
DISSECTION: %{f1} %{f2->} %{f3} %{f4}
EVENT: {"f1" => "foo", "f2" => "bar", "f3" => "baz", "f4" => "quux"}
-------
Padding before the field
DATA: "bar             baz foo quux"
DISSECTION: %{f1->} %{f2} %{f3} %{f4}
EVENT: {"f1" => "bar", "f2" => "baz", "f3" => "foo", "f4" => "quux"}

jordansissel commented 7 years ago

How about if I added a suffix to indicate that the delimiter following this field should be greedy?

@guyboertje +1 to having a greedy/non-greedy feature. I did a similar thing for fex a while ago (an unrelated project) where you can turn on/off greedy for a given match:

{?range,field,...}
    The {?...} notation turns on 'non greedy' field separation. The differences here can be shown best by example, first:

      % echo "1...2.3.4" | fex '.{1:3}'
      1.2.3
      % echo "1...2.3.4" | fex '.{?1:3}'
      1..

    In the first example, fex uses '.' as delimiter and ignores empty fields. In the second example (non greedy), it does not ignore those empty fields.

(We don't have to use ? for the syntax, but just giving a strong thumbs-up on having a non-greedy option)

guyboertje commented 7 years ago

@jordansissel Nice. Only my impl is non-greedy by default with opt in for fields that have padding after it.

mrhoric commented 7 years ago

hi, have you solved this problem? i have the same issue now, but i cant not find the answer in this issue

guyboertje commented 7 years ago

@mrhoric This PR is pending - it will fix this issue https://github.com/logstash-plugins/logstash-filter-dissect/pull/34

marcofvera commented 7 years ago

I had the same problem due to a field missing..

guyboertje commented 7 years ago

fixed with #34 but not published to rubygems. Waiting for #37

bravelib commented 6 years ago

I had the same problem due to a field missing..

guyboertje commented 6 years ago

@bravelib v1.1.1 has been released. You should update the plugin to see whether it works for you.

logstash-plugins / logstash-filter-dissect

Missing field values cause dissected fields to be out of position #11