vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.69k stars 1.57k forks source link

Syslog Structure Data fields appear in sub-object instead of as root properties #12410

Closed gschier closed 2 years ago

gschier commented 2 years ago

A note for the community

No response

Problem

The structured data in the Syslog source does not end up as root properties but is instead included in a sub-object keyed on the Syslog SD-ID. The docs clearly state that structure data properties will appear on the root object in the output section:

image

As well as in the Example:

image

Configuration

[sources.in_syslog]
type = "syslog"
address = "127.0.0.1:5144"
mode = "udp"

Version

vector 0.17.3 (x86_64-apple-darwin d72c6e7 2021-10-21)

Debug Output

No response

Example Data

Syslog Message:

<%d>1 2022-04-25T23:21:45.715740Z Gregorys-MacBook-Pro.local 2d4d9490-794a-4e60-814c-5597bd5b7b7d 79978 - [exampleSDID@32473 foo="bar"] test message

Resulting event:

{
  "appname": "2d4d9490-794a-4e60-814c-5597bd5b7b7d",
  "exampleSDID@32473": { // These should be root properties according to docs
    "foo": "bar"
  },
  "facility": "kern",
  "host": "Gregorys-MacBook-Pro.local",
  "hostname": "Gregorys-MacBook-Pro.local",
  "message": "test message",
  "procid": 79978,
  "severity": "info",
  "source_ip": "127.0.0.1",
  "source_type": "syslog",
  "timestamp": "2022-04-25T23:21:45.715740Z",
  "version": 1
}

Additional Context

No response

References

This other issue also found some discrepancies with the Syslog Example https://github.com/vectordotdev/vector/issues/9281

StephenWakely commented 2 years ago

I am a bit surprised that this test passes if this is the case.

hhromic commented 2 years ago

The test is fine, I think this is a documentation issue. Probably outdated documentation. In my opinion, it is good/desirable that the syslog structured data fields are properly namespaced in the resulting object. We use this feature as well and is very convenient, otherwise SDs with clashing properties would be all over the place.

jszwedko commented 2 years ago

I am a bit surprised that this test passes if this is the case.

I think that test is showing the behavior described by this issue, that the fields are namespaced under the "name" of the structured data section. I agree with @hhromic that the namespace is desirable to avoid conflicts. I think we should just update the docs.

hhromic commented 2 years ago

I went to refresh my memory about this subject in our deployed pipeline and we are using the parse_syslog() VRL in a remap, not the syslog source. Apologies! In our setup with VRL, Vector is not parsing sub-objects from the syslog SD fields, but indeed as simple root-level fields with namespaces from the SD. For example:

parsed = parse_syslog!(s'<1>1 2022-04-25T23:21:45.715740Z Gregorys-MacBook-Pro.local 2d4d9490-794a-4e60-814c-5597bd5b7b7d 79978 - [exampleSDID@32473 foo="bar"] test message')
# { "appname": "2d4d9490-794a-4e60-814c-5597bd5b7b7d", "exampleSDID@32473.foo": "bar", "facility": "kern", "hostname": "Gregorys-MacBook-Pro.local", "message": "test message", "procid": 79978, "severity": "alert", "timestamp": t'2022-04-25T23:21:45.715740Z', "version": 1 }

The parsed SD field is {"exampleSDID@32473.foo": "bar"} which is NOT a sub-object, just a plain field with string-type key exampleSDID@32473.foo and value bar.

$ parsed.exampleSDID@32473.foo
null

$ parsed."exampleSDID@32473.foo"
"bar"

BUT! The syslog source indeed is parsing the SD fields as sub-objects, so there is an inconsistency there:

{"appname":"2d4d9490-794a-4e60-814c-5597bd5b7b7d","exampleSDID@32473":{"foo":"bar"},"facility":"kern","host":"Gregorys-MacBook-Pro.local","hostname":"Gregorys-MacBook-Pro.local","message":"test message","procid":79978,"severity":"alert","source_ip":"127.0.0.1","source_type":"syslog","timestamp":"2022-04-25T23:21:45.715740Z","version":1}

Looks like the documentation is indeed aligned with the parse_syslog() VRL function behaviour but the syslog source is expanding the .-separated namespaces in the keys into sub-objects?

hhromic commented 2 years ago

Yes, can confirm that the syslog source is "unnesting" keys with periods in them. I just sent this packet with foo.baz as the attribute of the exampleSDID@32473 SD:

<1>1 2022-04-25T23:21:45.715740Z Gregorys-MacBook-Pro.local 2d4d9490-794a-4e60-814c-5597bd5b7b7d 79978 - [exampleSDID@32473 foo.baz="bar"] test message

And got this from the syslog source:

{
  "appname": "2d4d9490-794a-4e60-814c-5597bd5b7b7d",
  "exampleSDID@32473": {
    "foo": {
      "baz": "bar"
    }
  },
  "facility": "kern",
  "host": "Gregorys-MacBook-Pro.local",
  "hostname": "Gregorys-MacBook-Pro.local",
  "message": "test message",
  "procid": 79978,
  "severity": "alert",
  "source_ip": "127.0.0.1",
  "source_type": "syslog",
  "timestamp": "2022-04-25T23:21:45.715740Z",
  "version": 1
}

Definitively not a good behaviour :)

The parse_syslog() VRL function does not exhibit this behaviour:

parse_syslog!(s'<1>1 2022-04-25T23:21:45.715740Z Gregorys-MacBook-Pro.local 2d4d9490-794a-4e60-814c-5597bd5b7b7d 79978 - [exampleSDID@32473 foo.baz="bar"] test message')
# { "appname": "2d4d9490-794a-4e60-814c-5597bd5b7b7d", "exampleSDID@32473.foo.baz": "bar", "facility": "kern", "hostname": "Gregorys-MacBook-Pro.local", "message": "test message", "procid": 79978, "severity": "alert", "timestamp": t'2022-04-25T23:21:45.715740Z', "version": 1 }

And yes, now I'm also surprised like @StephenWakely that the referenced test is passing :) Maybe there is some automatic unnesting going on?

jszwedko commented 2 years ago

Aha, interesting, thanks for that investigation @hhromic . In my opinion, parse_syslog should match the syslog source behavior and actually nest the fields (granted that this would be a breaking change).

The test @StephenWakely is referencing is in the syslog source. The .insert() calls seen there will interpret the .s as creating nested objects (the function takes a "path").

jszwedko commented 2 years ago

Opened https://github.com/vectordotdev/vector/issues/12431 to address the mismatch.

hhromic commented 2 years ago

In my opinion, parse_syslog should match the syslog source behavior and actually nest the fields (granted that this would be a breaking change).

If aligning the behaviour is the goal, it will be a breaking change one way or another :( Regarding which approach is desirable, is a good question. Perhaps "nested" is indeed more convenient/powerful, especially when iteration support lands and these keys can be easily iterated over/manipulated for.. reasons!

In the worst case, in VRL you can always obtain the flattened version with flatten() from the nested object (if needed).