scylladb / kafka-connect-scylladb

Kafka Connect Scylladb Sink
Apache License 2.0
42 stars 22 forks source link

Sink does not support update operation #16

Open haaawk opened 4 years ago

haaawk commented 4 years ago

It would be highly useful to support not only insert and delete but also update. That would allow live migrations using Change Data Capture as a source of data.

haaawk commented 4 years ago

It seems that headers in kafka message could be use to distinguish inserts and updates. That would also allow implementation of other operations like partition deletion or range deletion.

It seems that we could also implement TTL per column using headers.

mailmahee commented 4 years ago

Is this related to Counters @haaawk - what is the difference between an insert and update from a scylla POV?

haaawk commented 4 years ago

It's related to all types - not only counters.

Difference between inserts and updates in Scylla is described here -> https://stackoverflow.com/questions/17348558/does-an-update-become-an-implied-insert/60075479#60075479

The problem also is that sink connector does not support unset values. It sets unset columns to null.

For example, when there is a row

pk |  ck |     v1 | v2
 1 |   2 |  test1 | test2

and you perform insert into ks.tb(pk, ck, v1) values (1, 2, test3) using sink connector then the result will be

pk |  ck |     v1  | v2
 1 |   2 |   test3 | <null>

and it should be:

pk |  ck |     v1  | v2
 1 |   2 |   test3 | test2

This means it is impossible to represent update ks.tb set v1 = 'test3' where pk = 1 and ck = 2 as an insert in the sink connector. Even if we wanted to.

mailmahee commented 4 years ago

So it needs to do a read before write and update all the values that are supplied and retain the values that were not supplied? I am assuming this is what will be useful for CDC?

haaawk commented 4 years ago

No. Reading before writing would be wasteful and non-performant. It just needs to support updates and unset values. In CQL operation each column can be in one of three states: have value, be null, or not be set at all. Sink connector supports only the first two.

haaawk commented 4 years ago

Another thing is that even reading before writing won't cut it because it would screw up timestamps of cells. Single CQL operation can have only one timestamp for all affected columns so you can't have unmodified columns with an old timestamp and updated columns with a new timestamp.

mailmahee commented 4 years ago

ok - @haaawk is it possible to give me an example on how to do a partial update to a row ? what would the CQL/prepared statement look like?

haaawk commented 4 years ago

You have examples above. Both insert into ks.tb(pk, ck, v1) values (1, 2, test3) and update ks.tb set v1 = 'test3' where pk = 1 and ck = 2 are setting only part of the row. Assuming the schema is:

create table ks.tb(
pk int,
ck int,
v1 text,
v2 text,
primary key(pk, ck)
)