koeninger / kafka-exactly-once

246 stars 87 forks source link

few questions about non transactional DB, end-to-end semantics #7

Closed mazorigal closed 7 years ago

mazorigal commented 7 years ago

hi,

I have few questions:

*** sorry from advance if that is not the correct place to ask it

is it possible to have the exactly once semantics with DB such as cassandra ? (maybe to use atomic batches to keep offsets in another table, but that is also only eventually consistence)

In the approach of saving offsets in same transaction, that will indeed guarantee exactly once reading from kafka, but when using foreach operations (especially those that have side effects such as DB writes), is it possible to achieve exactly once end-to-end ? in case the foreach operation would be re-executed, it can cause to duplications in the DB (especially in cases where some count logic involved and when it is impossible to have idempotent writes)

if I have the following pattern: for each RDD in Dstream: for each kafka record in RDD: currValue = read value from c with key= kafka record.key newValue = create new value based on the kafka record value and currValue write newValue back to c where key=record.key

Is there way to have in such pattern exactly once ? I thought to use at-least-once reading from kafka and to store in c with each newValue the partition/offset of the last processed kafka record corresponding to that c key. In case of some failure (whether its committing the offset back to kafka or re-execution of the RDD) which would require data reprocessing, since part of my pattern is first to read data from c, I can check if the offset of the kafka record being processed is larger then the offset stored with the c record)

any possible pitfalls with such design ?

koeninger commented 7 years ago

Spark mailing list is probably better for future questions.

For exactly once output, you need to store offsets and aggregate data in a single transaction, at least on a per partition basis. I don't think that's very plausible for Cassandra, because you need to abort the transaction if the offsets have already been stored, ie a batch is being repeated.

On Dec 16, 2016 04:23, "igor mazor" notifications@github.com wrote:

hi,

I have few questions:

*** sorry from advance if that is not the correct place to ask it

is it possible to have the exactly once semantics with DB such as cassandra ? (maybe to use atomic batches to keep offsets in another table, but that is also only eventually consistence)

In the approach of saving offsets in same transaction, that will indeed guarantee exactly once reading from kafka, but when using foreach operations (especially those that have side effects such as DB writes), is it possible to achieve exactly once end-to-end ? in case the foreach operation would be re-executed, it can cause to duplications in the DB (especially in cases where some count logic involved and when it is impossible to have idempotent writes)

if I have the following pattern: for each RDD in Dstream: for each kafka record in RDD: currValue = read value from c with key= kafka record.key newValue = create new value based on the kafka record value and currValue write newValue back to c where key=record.key

Is there way to have in such pattern exactly once ? I thought to use at-least-once reading from kafka and to store in c with each newValue the partition/offset of the last processed kafka record corresponding to that c key. In case of some failure (whether its committing the offset back to kafka or re-execution of the RDD) which would require data reprocessing, since part of my pattern is first to read data from c, I can check if the offset of the kafka record being processed is larger then the offset stored with the c record)

any possible pitfalls with such design ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/koeninger/kafka-exactly-once/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGAB0uzGWSGvwnjhU2KhK9Rqp-hWcFaks5rImaSgaJpZM4LPCuE .