vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.12k stars 1.6k forks source link

Update Syslog source to accept non UTF-8 encoding in syslog message #20462

Open Neko-Follower opened 6 months ago

Neko-Follower commented 6 months ago

A note for the community

Problem

Vector drops logs when encounter a syslog message with non UTF-8 characters. Can you add an option to replace non utf-8 characters with U+FFFD or allow passing non-UTF8 text as-is like a Promtail do.

Configuration

No response

Version

0.37.1-distroless-static

Debug Output

No response

Example Data

2024-05-08T07:35:43.209847Z DEBUG source{component_kind="source" component_id=rsyslog component_type=syslog}:connection{peer_addr=172.22.0.4:44600}: vector::sources::util::net::tcp: Accepted a new connection. peer_addr=172.22.0.4:44600

2024-05-08T07:35:44.293974Z ERROR source{component_kind="source" component_id=rsyslog component_type=syslog}:connection{peer_addr=172.22.0.4:44594}: vector::internal_events::codecs: Failed framing bytes. error=Unable to decode input as UTF8 error_code="decoder_frame" error_type="parser_failed" stage="processing" internal_log_rate_limit=true

2024-05-08T07:35:44.294029Z ERROR source{component_kind="source" component_id=rsyslog component_type=syslog}:connection{peer_addr=172.22.0.4:44594}: vector::internal_events::codecs: Internal log [Failed framing bytes.] is being suppressed to avoid flooding.

Additional Context

No response

References

No response

jszwedko commented 6 months ago

Agreed, this could be modeled like the existing decoding.codec.json.lossy option which replaces invalid UTF-8 characters.

jszwedko commented 6 months ago

We'd be happy to see a PR for this if someone is motivated! It should be a relatively straightforward change.

osas1111 commented 4 months ago

I have the same problem with Fluent source

kevinmingtarja commented 3 months ago

Hi, I'm interested in contributing to this!

jszwedko commented 3 months ago

Hi, I'm interested in contributing to this!

Great! We'd be happy to review a PR. You can see https://github.com/vectordotdev/vector/pull/17628 as an example of when it was added to the JSON decoder.

kevinmingtarja commented 3 months ago

Hi @jszwedko, just to confirm, it seems like the lossy option for syslog has been added in this PR #17680.

Is there anything else missing from that PR that could be causing this bug?

I wrote a unit test for /lib/codecs/src/decoding/format/syslog.rs and was able to verify that SyslogDeserializer::default().parse() does replace the non UTF-8 characters with the replacement character.

jszwedko commented 3 months ago

Ah, and so it was. I forgot this issue is about the syslog source rather than the syslog decoder. I think we still need to add the option to the source.