jitsucom / jitsu

Jitsu is an open-source Segment alternative. Fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days
https://jitsu.com
MIT License
4.11k stars 292 forks source link

Apache Kafka as a Destination #174

Closed gervarela closed 10 months ago

gervarela commented 3 years ago

Problem

Apache Kafka is a common source of data for streaming and real-time analytics. In those setups it is also commonly used as a kind of 'data lake' or 'source of truth' where all the history is registered before any transformation and ingestion is done, so that in the future it can be 'replicated' in case of a bug or new feature downstream.

We propose to support the output of EventNative events (in streaming mode) to a Kafka topic (maybe one topic per collection). Downstream, this topic(s) could be used to any number of many other systems for real-time analytics like :

All of these systems already support Kafka ingestion.

Solution

Develop support for using Kafka topics as destinations.

A good solution would be allow one topic per each EventNative collection, so different types of events are written to different topics, and the schema of the events of each topic can be known in advance.

The pipeline will be like this: Tracker (JSON) --> EventNative --> Kafka (JSON) --> Others

A nice add-on feature would be to use also Kafka topics to store 'failed events'. In this case you will have one 'good' topic, and one 'bad' topic per each collection. This would allow, among other things, for the monitoring of the 'bad' events topic, in real-time, to detect and manage failures.

azhard4int commented 2 years ago

@absorbb @vklimontovich

As per our discussion on the Slack, the number of incoming files for the Jitsu increased while using the batch streaming mode with log.rotation_min: 1, when event rate is 2000/min. The Jitsu uploader process is sequential and does not process all of the events, hence leaving the backlog of events.

As a long term solution, Kafka can be added a a destination with Streaming mode, and the events are sent to the Kafka topics and then sent to the Clickhouse.