jitsucom / bulker

Service for bulk-loading data to databases with automatic schema management (Redshift, Snowflake, BigQuery, ClickHouse, Postgres, MySQL)
https://github.com/jitsucom/bulker
MIT License
134 stars 17 forks source link

Configuring Jitsu Bulker for Multi-Partition Kafka Topics #17

Open ZiyaadQasem opened 3 weeks ago

ZiyaadQasem commented 3 weeks ago

In Kubernetes deployment of Jitsu, the Bulker component is responsible for batching events and sending them to a ClickHouse instance. Currently, the Kafka topic that Bulker creates and consumes is configured with only one partition.

Challenges Encountered:

Questions:

vklimontovich commented 2 weeks ago

We discussed it for a while internally and decided not to implement parallel processing at the moment. For data streams with enabled deduplication, parallel processing can break it. E.g. if two consumers will run MERGE statements in parallel, most databases won't guarantee the correctness.

For non-deduped streams it can give you a performance boost, but most of the use-cases we see require deduplication.

If we ever decide to go forward with this issue, here's what we would do:

Meanwhile, I suggest to implement parallelization with having different destinations per each table

absorbb commented 2 weeks ago

Meanwhile, I suggest to implement parallelization with having different destinations per each table

Actually, topics are created per table so we have that kind of parallelism.

To workaround current limitations, you can duplicate the destination and connection, then rotate writeKeys on client side or split traffic using JavaScript function. Deduplication may still work unreliably in this scenario.