Closed ppf2 closed 7 years ago
+1 ; It would be great if the separators amalgamation could be switched off (and on) by an option to the dissect filter. There are cases when the separators amalgamation is very useful (spaces) but there are cases when it is not (say, a string of something-separated-values format with missing/empty values for some fields).
@ppf2, @jordansissel
How about if I added a suffix to indicate that the delimiter following this field should be greedy? I'm thinking ->
, meaning that users with space padded text have to opt in. It also means not having to commit the whole dissection to either consume-all or consume-one delimiter.
Example: Data:
2017-06-28 12:12:12 SHORT: f1,,f3,,,f6
Mapping:
%{date} %{+date->} %{APP}: %{csv1},%{csv2},%{csv3},%{csv4},%{csv5},%{csv6}
What do you think?
^ pinging @ppf2 @jordansissel
@guyboertje Thx, somehow I missed this update :) So to confirm, with the proposed approach, will the above handle the scenario where the lines have 4 dates in the beginning, the 3 dates following the first one can sometimes be empty string (no consistency, sometimes 2nd is empty, sometimes 4th is empty, sometimes, 3th and 4th are empty, and sometimes 2nd and 4th are, etc..). And will the config be something like:
%{date} %{+date->} %{+date->} %{+date->} %{APP}: %{csv1},%{csv2},%{csv3},%{csv4},%{csv5},%{csv6}
@ppf2 - exactly the opposite. If you need greedy delimiter consumption you must opt in. For missing fields no suffix is needed. For delimiter padded variable length fields, use suffix to tell Dissect to consume any extra delimiters. As I see it the missing field case occurs far more often than fields padded by multiple delimiters.
Ah got it, was reading it wrong. Makes sense to me!
@ppf2 In your missing date example will the missing field always have two delimiters?
E.G.1 f1-f2-f3-f4-f5-f6
E.G.2 f1--f3-f4-f5-f6
E.G.3 f1-f2--f4-f5-f6
E.G.4 f1--f3--f5-f6
Then this will 'see' the gaps in fields.
%{fld1}-%{fld2}-%{fld3}-%{fld4}-%{fld5}-%{fld6}
Do you have a preference for the suffix or rather, what character(s) shout out greedy to you?
In your missing date example will the missing field always have two delimiters?
It can be any number of delimiters.
Don't have a preference for the character to use for the suffix :)
@ppf2
When I say two delimiters, I mean the declared delimiters on either side of a field that may or may not be present, i.e.
DISSECTION: %{f1}--%{f2}: %{f3}, %{f4}
DATA: aaa--bbb: ccc, ddd
FIELDS: f1 => "aaa", f2 => "bbb", f3 => "ccc", f4 => "ddd"
DATA: aa1--: cc1, dd1
FIELDS: f1 => "aa1", f2 => "", f3 => "cc1", f4 => "dd1"
Note the concatenated delimiters around the missing field --:
.
Summary:
The proposed suffix ->
is ONLY required when multiple delimiters are to be greedily consumed to reach the next non-missing field. This is due to some loggers 'padding' a field with spaces to create a better human reading experience.
This means that Dissect will only seek to ONE occurrence of the next delimiter pattern and then it will know its on a field start boundary - when the field is missing, it will immediately seek to the occurrence of the next delimiter pattern (it finds this immediately) so its now on the field end boundary, for an empty field start == end.
Not greedy example
DATA: ",,,"
DISSECTION: %{f1},%{f2},%{f3},%{f4}
EVENT: {"f1" => "", "f2" => "", "f3" => "", "f4" => ""}
In the case of needing the greedy behaviour, the suffix will not appear in the field name when it creates the field. Greedy example (its saying, consume all the spaces between bar and baz)
Padding after the field
DATA: "foo bar baz quux"
DISSECTION: %{f1} %{f2->} %{f3} %{f4}
EVENT: {"f1" => "foo", "f2" => "bar", "f3" => "baz", "f4" => "quux"}
-------
Padding before the field
DATA: "bar baz foo quux"
DISSECTION: %{f1->} %{f2} %{f3} %{f4}
EVENT: {"f1" => "bar", "f2" => "baz", "f3" => "foo", "f4" => "quux"}
How about if I added a suffix to indicate that the delimiter following this field should be greedy?
@guyboertje +1 to having a greedy/non-greedy feature. I did a similar thing for fex a while ago (an unrelated project) where you can turn on/off greedy for a given match:
{?range,field,...}
The {?...} notation turns on 'non greedy' field separation. The differences here can be shown best by example, first:
% echo "1...2.3.4" | fex '.{1:3}'
1.2.3
% echo "1...2.3.4" | fex '.{?1:3}'
1..
In the first example, fex uses '.' as delimiter and ignores empty fields. In the second example (non greedy), it does not ignore those empty fields.
(We don't have to use ?
for the syntax, but just giving a strong thumbs-up on having a non-greedy option)
@jordansissel Nice. Only my impl is non-greedy by default with opt in for fields that have padding after it.
hi, have you solved this problem? i have the same issue now, but i cant not find the answer in this issue
@mrhoric This PR is pending - it will fix this issue https://github.com/logstash-plugins/logstash-filter-dissect/pull/34
I had the same problem due to a field missing..
fixed with #34 but not published to rubygems. Waiting for #37
I had the same problem due to a field missing..
@bravelib v1.1.1 has been released. You should update the plugin to see whether it works for you.
Sample log entry:
Dissect:
This results in:
See how the position of the dissected fields are "off" (shifted).
One workaround is to do a gsub to replace "" with something like "NOT_SET" before sending it to the dissect filter, etc..
Per chat with @guyboertje , dissect will not tolerate variable extra spaces - think where values are right padded with spaces to visually line up fields. Due to the padding, its by design.
Filing an enhancement request to see if we can handle this use case in the future. For now, it can be helpful to document the handling of fields that have (may) have no values in some log lines.