vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.06k stars 1.48k forks source link

clickhouse sink doesn't support encoding.codec = raw_message #20699

Open acpeakhour opened 4 weeks ago

acpeakhour commented 4 weeks ago

A note for the community

The clickhouse sink doesn't support encoding.codec = raw_message causing high CPU load parsing json in the transform.

Use Cases

Many of the sinks already support encoding.codec = raw_message - why not clickhouse?

Attempted Solutions

No response

Proposal

No response

References

No response

Version

No response

jszwedko commented 3 weeks ago

Hi @acpeakhour ,

The clickhouse sink only deals with structured data which is why the codec option is not supported. Could you describe your use-case a bit more? How would you expect the raw_message to be sent to Clickhouse?

acpeakhour commented 3 weeks ago

The use case arises when the payload is JSON and is contained within the message field, which is typical for JSON data from external sources. Inserting this structured data directly into Clickhouse is convenient.

Currently, the only solution for this is a transform with:

source = ''' . |= object!(parse_json!(.message)) '''

Followed by using encoder.skip_fields or further .del(.message) to remove Vector-specific metadata from the event. Clickhouse already parses JSON data during insertion. Supporting raw_message for the Clickhouse sink would:

Without this feature, we're forced to use Vector's event schema in Clickhouse and then parse it there, or have Vector parse the JSON, which is less efficient.

The HTTP sink supports encoding.codec = raw_message, but it doesn't support templates in the URI, limiting its usefulness. As a result, we currently have to accept high Vector CPU usage when inserting into Clickhouse.

Supporting raw_message in the Clickhouse sink would align it with other sinks' capabilities and provide users with more control over their data pipeline, potentially improving performance and reducing complexity.

acpeakhour commented 3 weeks ago

This is the same issue for the Elasticsearch Sink

jszwedko commented 3 weeks ago

Thanks for the additional detail! I think I'm still missing something though. I'm not super familiar with Clickhouse, but it seems to require you to insert structured / formatted data: https://clickhouse.com/docs/en/sql-reference/statements/insert-into. What would the INSERT statement look like with raw text? Maybe LineAsString (https://clickhouse.com/docs/en/sql-reference/formats#lineasstring)? Could you give an example INSERT statement?

acpeakhour commented 3 weeks ago

INSERT INTO table FORMAT JSONEachRow

https://clickhouse.com/docs/en/integrations/data-formats/json

jszwedko commented 3 weeks ago

INSERT INTO table FORMAT JSONEachRow

https://clickhouse.com/docs/en/integrations/data-formats/json

That's what Vector already uses though 🤔

jszwedko commented 3 weeks ago

Ah, I think I see what you are saying. If message is already JSON we could insert it directly rather than requiring it to be parsed in Vector. Agreed, that seems like a reasonable enhancement to this sink.

acpeakhour commented 3 weeks ago

Yes, that is what I am saying. Supporting raw_message in the sink saves the json_parse in the transform in the case where the message content is JSON. I think it is likely the same for the elasticsearch sink as well.

For me at least, all our events are JSON. It seems using the json_parse was a common workaround for this. I jumped for joy when I saw raw_message supported as a codec, but cried when it wasn't for clickhouse and elastic.

I believe this is a common use case.

jszwedko commented 3 weeks ago

Agreed, this does seem like a potentially common use-case if not using Vector for any event processing (which would typically require parsing).

Mohan777-G commented 3 weeks ago

Hello @acpeakhour , thank you for raising this issue ,which helped me to end up in this thread. @jszwedko i am trying to get id_key the combination of topic+offset+parition but it doesnt work ,below is my configuration. if you have played around with such configs could you advice whats is correct way sources: kafka-source: type: kafka bootstrap_servers: my-kafka-service:9093 group_id: test-vector auto_offset_reset: latest commit_interval_ms: 500 decoding: codec: json librdkafka_options: partition.assignment.strategy: "roundrobin" topics:

sinks: elasticsearch_sink: type: elasticsearch inputs: ["msg_split"] api_version: auto compression: none endpoints:

zu3st commented 1 week ago

I quickly hacked something together, changing the sinks ClickhouseConfig members encoding type from Transformer to EncodingConfigWithFraming: https://github.com/vectordotdev/vector/commit/8b99f9836d095b3845595ab3c9f0e28aab613657

This change allows (and requires) for encoding.codec to be specified by the user. I ran some quick tests and clickhouse rows were correctly inserted for both encoding.codec = raw_message and encoding.codec = json.

Since I have no substantial knowledge of neither Rust nor Vector, the code should be taken with a huge grain of salt, but it may indicate that no major changes are necessary.