Open acpeakhour opened 4 weeks ago
Hi @acpeakhour ,
The clickhouse
sink only deals with structured data which is why the codec option is not supported. Could you describe your use-case a bit more? How would you expect the raw_message
to be sent to Clickhouse?
The use case arises when the payload is JSON and is contained within the message field, which is typical for JSON data from external sources. Inserting this structured data directly into Clickhouse is convenient.
Currently, the only solution for this is a transform with:
source = ''' . |= object!(parse_json!(.message)) '''
Followed by using encoder.skip_fields or further .del(.message) to remove Vector-specific metadata from the event. Clickhouse already parses JSON data during insertion. Supporting raw_message for the Clickhouse sink would:
Without this feature, we're forced to use Vector's event schema in Clickhouse and then parse it there, or have Vector parse the JSON, which is less efficient.
The HTTP sink supports encoding.codec = raw_message, but it doesn't support templates in the URI, limiting its usefulness. As a result, we currently have to accept high Vector CPU usage when inserting into Clickhouse.
Supporting raw_message in the Clickhouse sink would align it with other sinks' capabilities and provide users with more control over their data pipeline, potentially improving performance and reducing complexity.
This is the same issue for the Elasticsearch Sink
Thanks for the additional detail! I think I'm still missing something though. I'm not super familiar with Clickhouse, but it seems to require you to insert structured / formatted data: https://clickhouse.com/docs/en/sql-reference/statements/insert-into. What would the INSERT
statement look like with raw text? Maybe LineAsString
(https://clickhouse.com/docs/en/sql-reference/formats#lineasstring)? Could you give an example INSERT
statement?
INSERT INTO table FORMAT JSONEachRow
https://clickhouse.com/docs/en/integrations/data-formats/json
INSERT INTO table FORMAT JSONEachRow
https://clickhouse.com/docs/en/integrations/data-formats/json
That's what Vector already uses though 🤔
Ah, I think I see what you are saying. If message
is already JSON we could insert it directly rather than requiring it to be parsed in Vector. Agreed, that seems like a reasonable enhancement to this sink.
Yes, that is what I am saying. Supporting raw_message in the sink saves the json_parse in the transform in the case where the message content is JSON. I think it is likely the same for the elasticsearch sink as well.
For me at least, all our events are JSON. It seems using the json_parse was a common workaround for this. I jumped for joy when I saw raw_message supported as a codec, but cried when it wasn't for clickhouse and elastic.
I believe this is a common use case.
Agreed, this does seem like a potentially common use-case if not using Vector for any event processing (which would typically require parsing).
Hello @acpeakhour , thank you for raising this issue ,which helped me to end up in this thread. @jszwedko i am trying to get id_key the combination of topic+offset+parition but it doesnt work ,below is my configuration. if you have played around with such configs could you advice whats is correct way sources: kafka-source: type: kafka bootstrap_servers: my-kafka-service:9093 group_id: test-vector auto_offset_reset: latest commit_interval_ms: 500 decoding: codec: json librdkafka_options: partition.assignment.strategy: "roundrobin" topics:
sinks: elasticsearch_sink: type: elasticsearch inputs: ["msg_split"] api_version: auto compression: none endpoints:
I quickly hacked something together, changing the sinks ClickhouseConfig
members encoding
type from Transformer
to EncodingConfigWithFraming
: https://github.com/vectordotdev/vector/commit/8b99f9836d095b3845595ab3c9f0e28aab613657
This change allows (and requires) for encoding.codec
to be specified by the user. I ran some quick tests and clickhouse rows were correctly inserted for both encoding.codec = raw_message
and encoding.codec = json
.
Since I have no substantial knowledge of neither Rust nor Vector, the code should be taken with a huge grain of salt, but it may indicate that no major changes are necessary.
A note for the community
The clickhouse sink doesn't support encoding.codec = raw_message causing high CPU load parsing json in the transform.
Use Cases
Many of the sinks already support encoding.codec = raw_message - why not clickhouse?
Attempted Solutions
No response
Proposal
No response
References
No response
Version
No response