vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.46k stars 1.53k forks source link

parse_syslog best effort detection issue #19258

Open nean-and-i opened 9 months ago

nean-and-i commented 9 months ago

A note for the community

Problem

It appears that the parse_syslog uses [ and ] as key, which insufficient, it should additionally respect @ and maybe another additional magic, or at least switchable. Else regular syslog messages containing [ and ] characters can get misinterpreted as RFC5424 syslog , which creates a lot of dynamic empty fields.

Per RFC5424 https://datatracker.ietf.org/doc/html/rfc5424#section-6.3.2 please use SD-ID for best effort detection.

Example:

In this example the syslog part [19347390.704884] is not a valid RFC5424 identifier and must not be interpreted as RFC5424!

INPUT:

{
    "message": "<4>Nov 29 13:21:35 TEST kernel: [19347390.704884] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20450930/power-338)"
}

VRL:

. = parse_syslog!(.message)

OUTPUT:

{
    "19347390.704884": {},  #<-- This creates a whole lot of issues when pushing it into a elastic sink
    "appname": "kernel",
    "facility": "kern",
    "hostname": "TEST",
    "indexdate": "2023-11-29",
    "message": "ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20450930/power-338)",
    "severity": "warning",
    "timestamp": "2023-11-29T12:21:35Z"
}

Configuration

. = parse_syslog!(.message)

Version

0.34.0

Debug Output

{
    "19347390.704884": {},
    "appname": "kernel",
    "facility": "kern",
    "hostname": "TEST",
    "indexdate": "2023-11-29",
    "message": "ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20450930/power-338)",
    "severity": "warning",
    "timestamp": "2023-11-29T12:21:35Z"
}

Example Data

<4>Nov 29 13:21:35 TEST kernel: [19347390.704884] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20450930/power-338) ### Additional Context _No response_ ### References _No response_
nean-and-i commented 9 months ago

Pleas note this issue appears to be present in the source type: syslog as well, where it even gets worse in terms of no workaround can be applied!

Below is a workaround by using a combination of source type: socket and replacefunction just before the parse_syslog

This ensures that the whole raw unprocessed syslog message via socket is able to reach a transform/remap section, so it can be fixed just before the parse_syslog function is applied to it.

This is very similar and reminds me a lot on a bug in a junos version from Juniper, there Juniper created invalid rfc5424 log stream and falling back to raw messages in rsyslog did the trick to repair/workaround that bug. In this case falling back to type socket is exactly the same, as the raw syslog message becomes available/fixable in message field.

[transforms.log_fields]
inputs = [ "log_syslog" ]
type = "remap"
source= '''
    .message = replace!(.message, r'(\[(?P<nonrfc5424>\w+\.\w+)\])', "$nonrfc5424")
    . = parse_syslog!(.message)
'''
# remap drop safety
drop_on_abort = true
drop_on_error = true
reroute_dropped = true

This issue have a very interesting butterfly effect when using elasticsearch/opensearch sink, as following error can be puzzling -> java.lang.IllegalArgumentException: Limit of total fields [1000] has been exceeded

(NOTE: if someone is reading this because a result of a search for the java elasticsearch/opensearch error leads to this article, but not using vector -> please, debug the key values being sent to elasticsearch/opensearch and ensure no dynamic or constantly changing key aka field values being used !)

java.lang.IllegalArgumentException: Limit of total fields [1000] has been exceeded
    at org.opensearch.index.mapper.MappingLookup.checkFieldLimit(MappingLookup.java:193) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.index.mapper.MappingLookup.checkLimits(MappingLookup.java:185) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.index.mapper.DocumentMapper.validate(DocumentMapper.java:329) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.index.mapper.MapperService.internalMerge(MapperService.java:488) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.index.mapper.MapperService.internalMerge(MapperService.java:444) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.index.mapper.MapperService.merge(MapperService.java:416) ~[opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:633) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:466) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:530) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:411) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:123) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:223) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.11.0.jar:2.11.0]
    at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.11.0.jar:2.11.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]

This happens because a lot of dynamic fields with empty rfc5424 objects can be created and sent to elasticsearch/opensearch, especially when the syslog part within [ ] contains timestamp information or any changing value eg. [19347390.704884]

StephenWakely commented 9 months ago

That message is not getting parsed as 5424. It is being treated as a 3164 message, which is much less rigidly defined.

What is the source of that message?

It does make sense to update that parser to be a bit more strict in determining what constitutes structured data. Something like [19347390.704884] is quite clearly not going to be SD...

nean-and-i commented 9 months ago

This log comes from a linux appliance looks like kernel ACPI messages, so it's actually not something out of ordinary.

<4>Nov 29 13:21:35 TEST kernel: [19347390.704884] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20450930/power-338)

indeed , the parse_syslog as well as the syslog input function (seems to follow same logic) should be aware of more specifier , and especially not creating empty or dynamic structure objects.

Afaik, booth RFC5424 and RFC3164 have additional specifier that shall be used @. Alternatively a configuration option would be handy to specify a type/rfc manually ?

tomas-sve commented 1 week ago

We also have the same problem with parse_syslog. In addition to Linux kernel messages already described, here are two example logs from Proxmox VE:

<30>Aug 21 11:16:28 pve01 corosync[1548]:   [KNET  ] rx: host: 2 link: 1 is up

and

<29>Aug 25 17:28:16 pve02 corosync[1539]:   [TOTEM ] Retransmit List: dc4d07f

where parse_syslog() will create the empty fields KNET: {} and TOTEM:{}, respectively.

Fixing it should be pretty easy? at least when just considering RFC3164/RFC5424, since only RFC 5424 has structured data and you can detect if this standard is used by the VERSION field: <PRI>1 <TIMESTAMP> means the log is RFC 5424 and <PRI><TIMESTAMP> means it is RFC 3164.

SD-IDs doesn't need an @ sign for IANA-registered SD-IDs, for example timeQuality.