Some clarification of README

linkedin / li-apache-kafka-clients

li-apache-kafka-clients is a wrapper library for the Apache Kafka vanilla clients. It provides additional features such as large message support and auditing to the Java producer and consumer in the open source Apache Kafka.

BSD 2-Clause "Simplified" License

133 stars 52 forks source link

Hi,

First I would like to thank you for open-sourcing the project. We have the same needs and we implemented a similar (yet less advanced) solution. I have some questions that might be added in README:

li-apache-kafka-clients is designed to handle sporadic large messages. If the users expect to have a high percentage of large messages (e.g. >50%), the users may want to consider reference-based-messaging which is also described in the slides below.

Could you explain why the reference based messaging is recommended in such a case? 2.

If the users only use commitSync(), commitAsync(), seekToCommitted() without specifying any particular offsets to commit or seek to, large message is completely transparent to the users.

How do you handle this scenario: there is one partition and 2 producers writing to the same topic in parallel. The large messages M1 et M2 will have multiple segments all mixed. Let's say we have the start of M1 (called M1S1) before M2, and the end of M1 (M1Sn) between the start of M2 (M2S1) et the end of it (M2Sn): M1S1 ---- M2S1 ---- M1Sn --- M2Sn Does acknowledging M1 will result in acknowledging all offsets between M1S1 and M1Sn, including some of M2 segments? Let's say the consumer crash after eating M1, but before M2. Can Kafka restart at the beginning of M2 instead of the end of M1?

@antoinetran Thanks for the interest. Please se the answers below:

Generally speaking, Kafka is not supposed to be used as a blob store. Its underlying data structure is not a perfect fit for that. The fact that all the messages are large messages indicates that some blob store is probably a better choice. The suggestion is generally true.
LiKafkaConsumer keeps track of the "safe offset" for each partition. In your example after M1Sn is consumed but before M2Sn is consumed, if offsets are committed, the safe offset will be M2S1, so the consumer will restart consumption at M2S1 next time it comes up.

You can check the following slides for more details. https://www.slideshare.net/JiangjieQin/handle-large-messages-in-apache-kafka-58692297

linkedin / li-apache-kafka-clients

Some clarification of README #84