so3500 / TIL

0 stars 0 forks source link

2024-01-09 #3

Open so3500 opened 8 months ago

so3500 commented 8 months ago

Table of Contents

  1. Kafka Documentation

Kafka Documentation

https://kafka.apache.org/documentation/

Introduction

[ what is event streaming? ] Event streaming ensures a continous flow and interpretation of data so that the right information is at the right place, at the ragit time

[ what can I use event streaming for? ] stock exchanges, banks logistics and the automotive industry factories and wind parks retail, hotel, travel industry emergencies

[ apache kafka is an event streaming platform ]

  1. publish(write) and subscribe to (read) streams of event, including continuous import/expert of your data from other systems
  2. store streams of events durably and reliably for as long as you want
  3. process streams of events as they occur or retrospectively

all this functionality is provided in a distributed, highly scalable, elastic, fault-tolerant, and secure manner. Kafka can be deployed on virtual machines, containers, and on-premiss as well as in the cloud.

[ How does Kafka work in a nutshell? ] Kafka is a distribued system consisting of servers and clients that communicate via a high performance TCP network protocol.

Servers

Clients

[ Main Concepts and Terminology ] keyword : event, producer, consumer, topic, partition, replication

event

producers are those client applications that publish(write) events to Kafka. comsumers are subscribe to (read and process) these events. In kafka producers and consumers fully decoupled and agnostic of each other, witch is a key design element to achieve the high scalability for example, producers never need to wait for consumers. Kafka provides various guarantees such as the ability to process event exactly once.

events are organized and durably stored in topics. topics in Kafka are always multi-producer and multi-subscriber : zero, one, or many Events in a topic can be read as often as needed - unlike traditional messaging systems, events are not deleted after consumption. Instead, you define for how long Kafka should retain your events through a per-topic configuration setting, after witch old events will be discarded. Kafka's performance is effectively constant with respect to data size, so storing data for a long time is perfectly fine

Topics are partitioned, meaning a topic is spread over a number of "buckets" located on different Kafka brokers. This distributed placement of your data is very important for scalability because it allows client applications to both read and write data from/to many brokers at the same time. Whena new event is published to a topic, it is actually appended to one of the topic's partitions. Events with the same event key (e.g., a customer or vehicle ID) are written to the same partition, and Kafka gurantees that any consumer of a given topic-partition will always read that partition's events in exactly the same order as the were written

image

to make your data fault-tolerant and highly-available, every topic can be replicated, even across geo-regions or datacenters, so that there are always multiple brokers that have a copy of the data just in case things go wrong, you want to do maintenance on the brokers, and so on. A common production setting is a replication factor of 3. i.e., there will always be three copies of your data. This replication is performed at the level of topic-partitions.

This primer should be sufficient for an introduction. The Design section of the documentation explains Kafka's various concepts in full detail, if you are interested.
 [ Kafka APIs ] In addition to command line tooling for management and administration tasks, Kafka has five core APIs for Java and Scala :

The Admin API to mange and inspect topics, brokers, and other Kafka objects. The Producer API to publish(write) a stream of events to one or more Kafka topics. The Consumer API to subscribe to (read) one or more topics and to process the stream of events produced to them. The Kafka Streams API to implement stream processing application and microservices. It provides higher-level functions to process event streams, including transformations, stateful operations like aggregations and joins, windowing, processing based on event-time, and more. Input is read from one or more topics in order to generate output to one or more topics, effectively transforming the input streams to output streams. The Kafka Connect API to build and run resuable data import/expoert connectors that consume(read) or produce(write) streams of events from and to external systems and applications so they can intergrate with Kafka. For excample. a connector to a relational database like PostgresSQL might capture every change to a set of tables. However, in pratice, you typically don't need to impelement your own connectors because the Kafka community already provides hundreads of ready-to-use connectors.

Where to go from here