Automatic table creation in ClickHouse

srstrickland commented 9 months ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

We would like to send various types of logs (some application logs, some structured "events") to ClickHouse without having to worry about schema management. I can already feel the ClickHouse gurus cringing at this notion, as there's a ton of power you can wield by controlling every column thoughtfully and intentionally. However, I think it makes a lot of sense for the JSONAsString and (still experimental) JSONAsObject formats. Through some light experimentation, I found that providing a simple table schema via: CREATE TABLE mytable (log JSON) ENGINE=MergeTree() ORDER BY tuple() was sufficient to allow vector to start streaming logs into it, and with JSONAsObject format, everything was typed as you would expect (and further protected from mayhem by using VRL's tag_types_externally function). I fully realize that to harness the full power of ClickHouse, we ought to create tables with optimized columns, but the mere existence of the JSONAsObject suggests that letting ClickHouse handle the schema management is a reasonable thing to do. So the only thing left is to be sure the table exists. Of course, if we wanted to go all out, vector could also support JSONEachRow, infer types based on incoming payloads, and create a column-ed table, and optionally auto-migrate the schema (by adding new columns as needed). But as that is significantly more work, I suggest we scope this feature to JSONAsObject and JSONAsString types for now. Since both database and table are templated, it might be reasonable to add a hook to auto-create databases as well.

Attempted Solutions

In the short term, I may stand up a proxy in the middle (since vector appears to interact with ClickHouse over HTTP), to intercept 404's, run some predefined create (db and) table template, and retry.

Proposal

Add a config block:

auto_create:
  database:
    enabled: boolean
  table:
    enabled: boolean
    columns:
      string: string
      ...
    engine: string

Note that the table spec would only be used to generate a CREATE TABLE command if the table doesn't exist; it should not ensure that any existing tables conform to the definition.

An implementation detail, but under the covers, CREATE TABLE IF NOT EXISTS should be used to deal with race conditions between multiple vector instances/threads.

References

No response

Version

vector 0.36.0 (x86_64-unknown-linux-gnu a5e48bb 2024-02-13 14:43:11.911392615)

srstrickland commented 9 months ago

Thinking of this further... I think there could be a lot of subtlety to this that would be custom to the user's clickhouse deployment (e.g. sharded / cluster? if so then need to create both a local version that's ReplicatedMergeTree on all shards, and a distributed version). So all that customization might be better suited to live outside of vector, where users can control such things based on what they know about their cluster. I will proceed with experimenting with the middleware / proxy, and see if there is functionality that's generic enough to bring to Vector.

jszwedko commented 9 months ago

Yeah, I tend to agree. I can see why it'd be useful to have Vector to be able to create tables to be able to template it, but it does open up a can of worms of supporting Clickhouse configuration.

srstrickland commented 9 months ago

Yeah, I spent some time building an HTTP proxy which does table creation & schema management, and it's sufficiently complex (tho configurable for different use cases) that I no longer think this behavior should be inside vector. There are maybe some things that could be done to make things more efficient (e.g. if vector could indicate the column names - and possibly types - via http headers, the proxy wouldn't have to read the payload most of the time), but there's nothing I can think of that could be well abstracted and avoid tight coupling.

vectordotdev / vector