Closed srstrickland closed 9 months ago
Thinking of this further... I think there could be a lot of subtlety to this that would be custom to the user's clickhouse deployment (e.g. sharded / cluster? if so then need to create both a local version that's ReplicatedMergeTree on all shards, and a distributed version). So all that customization might be better suited to live outside of vector, where users can control such things based on what they know about their cluster. I will proceed with experimenting with the middleware / proxy, and see if there is functionality that's generic enough to bring to Vector.
Yeah, I tend to agree. I can see why it'd be useful to have Vector to be able to create tables to be able to template it, but it does open up a can of worms of supporting Clickhouse configuration.
Yeah, I spent some time building an HTTP proxy which does table creation & schema management, and it's sufficiently complex (tho configurable for different use cases) that I no longer think this behavior should be inside vector. There are maybe some things that could be done to make things more efficient (e.g. if vector could indicate the column names - and possibly types - via http headers, the proxy wouldn't have to read the payload most of the time), but there's nothing I can think of that could be well abstracted and avoid tight coupling.
A note for the community
Use Cases
We would like to send various types of logs (some application logs, some structured "events") to ClickHouse without having to worry about schema management. I can already feel the ClickHouse gurus cringing at this notion, as there's a ton of power you can wield by controlling every column thoughtfully and intentionally. However, I think it makes a lot of sense for the
JSONAsString
and (still experimental)JSONAsObject
formats. Through some light experimentation, I found that providing a simple table schema via:CREATE TABLE mytable (log JSON) ENGINE=MergeTree() ORDER BY tuple()
was sufficient to allow vector to start streaming logs into it, and withJSONAsObject
format, everything was typed as you would expect (and further protected from mayhem by using VRL'stag_types_externally
function). I fully realize that to harness the full power of ClickHouse, we ought to create tables with optimized columns, but the mere existence of the JSONAsObject suggests that letting ClickHouse handle the schema management is a reasonable thing to do. So the only thing left is to be sure the table exists. Of course, if we wanted to go all out, vector could also supportJSONEachRow
, infer types based on incoming payloads, and create a column-ed table, and optionally auto-migrate the schema (by adding new columns as needed). But as that is significantly more work, I suggest we scope this feature toJSONAsObject
andJSONAsString
types for now. Since bothdatabase
andtable
are templated, it might be reasonable to add a hook to auto-create databases as well.Attempted Solutions
In the short term, I may stand up a proxy in the middle (since vector appears to interact with ClickHouse over HTTP), to intercept 404's, run some predefined create (db and) table template, and retry.
Proposal
Add a config block:
Note that the
table
spec would only be used to generate aCREATE TABLE
command if the table doesn't exist; it should not ensure that any existing tables conform to the definition.An implementation detail, but under the covers,
CREATE TABLE IF NOT EXISTS
should be used to deal with race conditions between multiple vector instances/threads.References
No response
Version
vector 0.36.0 (x86_64-unknown-linux-gnu a5e48bb 2024-02-13 14:43:11.911392615)