Closed raphaelauv closed 2 years ago
the documentation say that the option at true give exactly-once delivery semantics -> https://github.com/scylladb/kafka-connect-scylladb/blob/master/documentation/CONFIG.md
`scylladb.offset.storage.table.enable`
If true, Kafka consumer offsets will be stored in ScyllaDB table. If false, connector will skip writing offset information into ScyllaDB (this might imply duplicate writes into ScyllaDB when a task restarts).
but that's contradicts https://github.com/scylladb/kafka-connect-scylladb/issues/40
So this documentation is not correct.
I think this is what the documentation meant: Kafka Connect offset storage by default saves the offsets every 60 seconds (offset.flush.interval.ms
). Offset storage on Scylla saves the offsets immediately after a successful INSERT
. This means that upon a crash/restart, the offsets stored on Kafka Connect would be less fresh than when stored in Scylla. But the documentation is not correct in saying that this makes the connector with Scylla offset storage have exactly-once semantics, as there could be a crash in-between INSERT
to Scylla and saving offsets to Scylla. However, using this offset mechanism will greatly reduce the number of duplicate INSERT
s.
However, even though the connector is not exactly once (there could be duplicate INSERT
s), that doesn't mean that as a result there will be duplicate rows in Scylla. Two inserts (generated by the connector):
> INSERT INTO my_table(pk, ck) VALUES (1, 1) USING TIMESTAMP 1620820115000000;
> INSERT INTO my_table(pk, ck) VALUES (1, 1) USING TIMESTAMP 1620820115000000;
> SELECT * FROM my_table;
pk | ck
----+----
1 | 1
will result in only a single row with the correct timestamp. The only exception to this is if you use Scylla CDC, in which case 2 CDC operations in CDC log will be stored.
Sorry for this documentation error - I will fix it soon.
thanks for the precise answer. Should I close the issue or wait for an update of the doc ?
Should I close the issue or wait for an update of the doc ?
I pushed a PR fixing this. Once it is merged, this issue will be closed automatically.
Like the confluent cassandra connector (not open-source) the scylladb connector have the option *.offset.storage.table.enable
But I could not found explanation on why this option exist. Why storing the offsets of the topics inside scylladb is better (and also a good option) than using kafka ( with the consumer_group of the connect ).
By setting false I could follow the lag of the consumer_group with my current kafka monitoring.
Could you explain what are the advantages of cassandra.offset.storage.table.enable=true
Thank you