vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.14k stars 1.6k forks source link

Splunk HEC Sink: indexed fields are not removed from event itself #7472

Open jerome-kleinen-kbc-be opened 3 years ago

jerome-kleinen-kbc-be commented 3 years ago

Vector Version

vector 0.13.1 (v0.13.1 x86_64-apple-darwin 2021-04-29)

Vector Configuration File

[sources.in]
type = "stdin"

[transforms.remap]
type = "remap"
inputs = ["in"]
source = """
. |= object!(parse_json!(.message))
"""

[sinks.out]
type = "splunk_hec"
inputs = ["remap"]
endpoint = "http://localhost:8088" # required
host_key = "host" # optional, no default
indexed_fields = ["foo"] # optional, no default
token = "26840a2f-b918-46c2-bf6c-509cbb0845ea" # required

# Encoding
encoding.codec = "json" # required

# Healthcheck
healthcheck.enabled = true # optional, default

Debug Output

Expected Behavior

The official Splunk HEC output for fluentd removes fields labeled as indexed fields from the event itself, see https://github.com/splunk/fluent-plugin-splunk-hec#when-data_type-is-event so I take it this is the "official" way to threat indexed fields.

Actual Behavior

The fields marked as indexed fields are not removed from the raw event before sending the log to Splunk.

Example Data

{ "foo": "indexed", "baz": "not_indexed"}

Additional Context

By default this is also done for other metadata fields like host, source and sourcetype when using the _key variant, see https://github.com/splunk/fluent-plugin-splunk-hec#keep_keys-boolean-optional

Vector currently supports the host_key parameter. The field set here is also not removed from the raw event.

References

jszwedko commented 3 years ago

Thanks @jeromekleinen-kbc . I think you are right that these fields should be dropped from the event itself.

You may have already seen this, but encoding.except_fields can be used as a workaround for now.

alexgavrisco commented 3 years ago

We have the same issue. Right now Splunk shows 2 values for indexed fields. It doesn't seem to affect querying though. So it just adds overhead for data ingestion.