Closed gervarela closed 10 months ago
@absorbb @vklimontovich
As per our discussion on the Slack, the number of incoming files for the Jitsu increased while using the batch streaming mode with log.rotation_min: 1, when event rate is 2000/min. The Jitsu uploader process is sequential and does not process all of the events, hence leaving the backlog of events.
As a long term solution, Kafka can be added a a destination with Streaming mode, and the events are sent to the Kafka topics and then sent to the Clickhouse.
Problem
Apache Kafka is a common source of data for streaming and real-time analytics. In those setups it is also commonly used as a kind of 'data lake' or 'source of truth' where all the history is registered before any transformation and ingestion is done, so that in the future it can be 'replicated' in case of a bug or new feature downstream.
We propose to support the output of EventNative events (in streaming mode) to a Kafka topic (maybe one topic per collection). Downstream, this topic(s) could be used to any number of many other systems for real-time analytics like :
All of these systems already support Kafka ingestion.
Solution
Develop support for using Kafka topics as destinations.
A good solution would be allow one topic per each EventNative collection, so different types of events are written to different topics, and the schema of the events of each topic can be known in advance.
The pipeline will be like this: Tracker (JSON) --> EventNative --> Kafka (JSON) --> Others
A nice add-on feature would be to use also Kafka topics to store 'failed events'. In this case you will have one 'good' topic, and one 'bad' topic per each collection. This would allow, among other things, for the monitoring of the 'bad' events topic, in real-time, to detect and manage failures.