jitsucom / jitsu

Jitsu is an open-source Segment alternative. Fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days
https://jitsu.com
MIT License
4.07k stars 287 forks source link

The clickhouse schema could be more efficient #1024

Closed acmeguy closed 1 year ago

acmeguy commented 1 year ago

Hi,

The events schema that Jitsu created is very ineffective both in terms of storage space and query speed. I would like to contribute on that front.

Who is the person to collaborate/sync with in that regard?

Regard,

absorbb commented 1 year ago

@acmeguy It will be great if you share some details in this ticket.

acmeguy commented 1 year ago

@absorbb

Here are some of the things to consider:

acmeguy commented 1 year ago

Other considerations (my opinions):

acmeguy commented 1 year ago

@absorbb - can I assist with this?

vklimontovich commented 1 year ago

Hi Stefan,

most of you suggestions make sense, and some of them could be implemented easily. However, most of them is not compatible with current Jitsu architecture.

Jitsu's DWH core — Bulker — doesn't know a semantics of data coming in. For example, it doesn't know if event field is present or not — users can send JSON with any structure. So it can't really partition table by (timestamp, type, event).

If you need to apply an optimization, we recommend table ahead of time. In this case Jitsu won't change it. If you can come up with a perfect events table structure, we can add it to documentation.

Here's a feedback on your concrete suggestions:

"Partition By" can be designed better as a composite key: timestamp, type, event (this needs a balance, preferably with the primary key (a single 'message_id' there is not efficient at all))

@absorbb owns this part of the platform, but I believe message_id must be a part of partition key, otherwise we won't be able to use ReplacingMergeTree which is underlying mechanism for deduplication. @absorbb, please correct me if I'm wrong

LowCardinality is not used at at all (massive storage efficiency for repeated values (even in 10ks))

That's something we can't know ahead of time. We could probably have a list of "hints". E.g. if we create a column type for ClickHouse, we should use LowCardinality.

_unmapped properties could be stored in a flattened map (or optionally the experimental JSON)

_unmapped is designed to be a "last resort". If we weren't be able to resolve a type, we just dump data there. It's not intended for actual usage, just for visibility. Theoretically, we can fix this. But I don't see any practical value

Nested structures in Clickhouse

This ones we can use. I believe it should be a separate optional feature

IP addresses are being stored as strings

What whould you suggest we use? IP addresses can come in different format - IP4, IP6, empty string etc. We can use some binary format, however we don't want to sacrifice user experience (e.g. being able to do WHERE IP='8.8.8.8'

Encoding is not being used where it would be beneficial, UUID/GUID are not being recognized and are stored as strings

See above, we don't really know a structure of data ahead of time, but we can implement type hints specifically for Clickhouse

Properties could be stored as maps (to prevent extra wide tables) (currently a performance loss)

Are you sure it's a loss? To my knowledge, ClickHouse is columnar database, so it will read only columns that are needed for the query. So it would be quite the opposite if we do maps: SELECT sum(property_abc) will read the whole map, not just the abc column.

*Properties could be prefixed by "properties". (if this schema does not need to be segment compatible) (less noise on root)**

See notes on nested properties. It make sense to implement as a separate feature

Overall, if you want to help, let's start with following:

acmeguy commented 1 year ago

@vklimontovich Thank you!

Some feedback.

Jitsu puts values into _unmappedProperties if the column types are changed, as it seems to look for "String" column type. (This makes local adjustments impossible)

IP4/IP6/UUID all work with automatic string conversion and quoted ":values" in search.

Clickhouse has an array of Map functions for all sorts of things, including aggregation. The current problem is performance. Clickhouse has plans to introduce "multiple sorted sub-map fields" that would address that without affecting the schema (invisible to the user). (Yes, Clickstore reads the whole map. This makes managing/knowing what properties were stored a lot easier as the user could get all the properties without knowing his own schema for each type)

message_id should be a part of the primary key, the last part. Order and Partition are underutilized now and a lot of query speed can be gained by doing this properly.

Semantic messages do have fields that are a part of the segment.com schema. Knowing them and supporting them properly does not affect the dynamic handling of extended message structures.

I have given this a lot of thought, as I have been persisting semantic events in clickhouse for several years (my old company). I want to believe that this would benefit Jitsu greatly.

You already support multiple schema options (multiple tables, One-Table, Legacy-Table). I think this could be a valuable addition to that.

Let's do this :)

absorbb commented 1 year ago

"Partition By" can be designed better as a composite key: timestamp, type, event (this needs a balance, preferably with the primary key (a single 'message_id' there is not efficient at all))

We partition by timestamp by default, not sure adding type and event will make a difference. Typically page events prevail and partitions for other events type may be underpopulated.

message_id should be a part of the primary key, the last part. Order and Partition are underutilized now and a lot of query speed can be gained by doing this properly.

message_id is already used for PK. Here is engine configuration with default settings:

engine = ReplicatedReplacingMergeTree('/clickhouse/tables/{shard}/[DB_NAME]/[TABLE_NAME]',
             '{replica}')
        PARTITION BY toYYYYMM(timestamp)
        PRIMARY KEY message_id
        ORDER BY message_id
        SETTINGS index_granularity = 8192;

Do you suggest adding more columns to PK?

Jitsu puts values into _unmappedProperties if the column types are changed, as it seems to look for "String" column type. (This makes local adjustments impossible)

I think i know what is going on here.. So in theory changing column type must not affect the way how data is processed. But bulker currently doesn't know about any clickhouse specific types and treats them as UNKNOWN. And no type conversion to UNKNOWN type is currently allowed so all data (even strings) is going to 'unmapped'. I guess we should lift type conversion restriction for UNKNOWN types and allow all data.

@acmeguy Thank you for your feedback! That is really helpful.

P.S: It is worth to mention that jitsu supports adding type hints to incoming object, e.g. you can use Functions to add type hints to the object in a form of property with name __sql_type_{column_name}, e.g:

export default async function(event, { log, fetch, props: config }) {
  // sql type hint for 'type' column
  event.__sql_type_type = 'LowCardinality(String)'
  // sql type hint for 'event' column
  event.__sql_type_event = 'LowCardinality(String)'
  return
}

type hints affect column type only on creation. existing column types wont be changed.

acmeguy commented 1 year ago

Sounds good, thank you.

These days I'm stuck in admitting our development partner onto our production platform. I would start coding if my mental capacity was not currently at 100% doing that.

As soon as that is done, I will write up a more comprehensive suggestion where I try to make the case for each change.

All the best,

acmeguy commented 1 year ago

Hi @absorbb and @vklimontovich,

I'm starting this work now. Any pointers regarding development setup and contribution?

acmeguy commented 1 year ago

@absorbb and @vklimontovich,

Please point me to some developer material when convenient. I will be looking at the Bulker only and hope to make this effort purely test driven.

acmeguy commented 1 year ago

So @absorbb and @vklimontovich,

I like explicit so much more than implicit.

Would you like input on this or not?

I have time this week to work on this.

vklimontovich commented 1 year ago

Hi @acmeguy! Thank you for you suggestion. At this point, our philosophy is that users should create tables ahead of time if they have large volumes. If you can make a CREATE TABLE statement with the most efficient types for ClickHouse, we will add it to documentation. We don't really want to change any code related to automated columns creation, at least for now

acmeguy commented 1 year ago

@vklimontovich - supporting more types of "user-created structure options" would be the main target.

Not accepting a contribution is perfectly fine.

Not helping in getting it done is just strange.

acmeguy commented 1 year ago

@vklimontovich and @absorbb

For user-create tables, consider supporting:

This way it becomes a real option for users to create a good Clickhouse table structure for these events.