scylladb / kafka-connect-scylladb

Kafka Connect Scylladb Sink
Apache License 2.0
46 stars 22 forks source link

Interest of scylladb.offset.storage.table.enable option #44

Closed raphaelauv closed 2 years ago

raphaelauv commented 3 years ago

Like the confluent cassandra connector (not open-source) the scylladb connector have the option *.offset.storage.table.enable

cassandra.offset.storage.table.enable=true
scylladb.offset.storage.table.enable=true

But I could not found explanation on why this option exist. Why storing the offsets of the topics inside scylladb is better (and also a good option) than using kafka ( with the consumer_group of the connect ).

By setting false I could follow the lag of the consumer_group with my current kafka monitoring.

Could you explain what are the advantages of cassandra.offset.storage.table.enable=true

Thank you

raphaelauv commented 3 years ago

the documentation say that the option at true give exactly-once delivery semantics -> https://github.com/scylladb/kafka-connect-scylladb/blob/master/documentation/CONFIG.md

`scylladb.offset.storage.table.enable`

If true, Kafka consumer offsets will be stored in ScyllaDB table. If false, connector will skip writing offset information into ScyllaDB (this might imply duplicate writes into ScyllaDB when a task restarts).

but that's contradicts https://github.com/scylladb/kafka-connect-scylladb/issues/40

avelanarius commented 3 years ago

So this documentation is not correct.

I think this is what the documentation meant: Kafka Connect offset storage by default saves the offsets every 60 seconds (offset.flush.interval.ms). Offset storage on Scylla saves the offsets immediately after a successful INSERT. This means that upon a crash/restart, the offsets stored on Kafka Connect would be less fresh than when stored in Scylla. But the documentation is not correct in saying that this makes the connector with Scylla offset storage have exactly-once semantics, as there could be a crash in-between INSERT to Scylla and saving offsets to Scylla. However, using this offset mechanism will greatly reduce the number of duplicate INSERTs.

However, even though the connector is not exactly once (there could be duplicate INSERTs), that doesn't mean that as a result there will be duplicate rows in Scylla. Two inserts (generated by the connector):

> INSERT INTO my_table(pk, ck) VALUES (1, 1) USING TIMESTAMP 1620820115000000;
> INSERT INTO my_table(pk, ck) VALUES (1, 1) USING TIMESTAMP 1620820115000000;

> SELECT * FROM my_table;
 pk | ck
----+----
  1 |  1

will result in only a single row with the correct timestamp. The only exception to this is if you use Scylla CDC, in which case 2 CDC operations in CDC log will be stored.

Sorry for this documentation error - I will fix it soon.

raphaelauv commented 3 years ago

thanks for the precise answer. Should I close the issue or wait for an update of the doc ?

avelanarius commented 3 years ago

Should I close the issue or wait for an update of the doc ?

I pushed a PR fixing this. Once it is merged, this issue will be closed automatically.