Open haaawk opened 4 years ago
It seems that headers in kafka message could be use to distinguish inserts and updates. That would also allow implementation of other operations like partition deletion or range deletion.
It seems that we could also implement TTL per column using headers.
Is this related to Counters @haaawk - what is the difference between an insert and update from a scylla POV?
It's related to all types - not only counters.
Difference between inserts and updates in Scylla is described here -> https://stackoverflow.com/questions/17348558/does-an-update-become-an-implied-insert/60075479#60075479
The problem also is that sink connector does not support unset values. It sets unset columns to null.
For example, when there is a row
pk | ck | v1 | v2
1 | 2 | test1 | test2
and you perform insert into ks.tb(pk, ck, v1) values (1, 2, test3)
using sink connector then the result will be
pk | ck | v1 | v2
1 | 2 | test3 | <null>
and it should be:
pk | ck | v1 | v2
1 | 2 | test3 | test2
This means it is impossible to represent update ks.tb set v1 = 'test3' where pk = 1 and ck = 2
as an insert in the sink connector. Even if we wanted to.
So it needs to do a read before write and update all the values that are supplied and retain the values that were not supplied? I am assuming this is what will be useful for CDC?
No. Reading before writing would be wasteful and non-performant. It just needs to support updates and unset values. In CQL operation each column can be in one of three states: have value, be null, or not be set at all. Sink connector supports only the first two.
Another thing is that even reading before writing won't cut it because it would screw up timestamps of cells. Single CQL operation can have only one timestamp for all affected columns so you can't have unmodified columns with an old timestamp and updated columns with a new timestamp.
ok - @haaawk is it possible to give me an example on how to do a partial update to a row ? what would the CQL/prepared statement look like?
You have examples above. Both
insert into ks.tb(pk, ck, v1) values (1, 2, test3)
and
update ks.tb set v1 = 'test3' where pk = 1 and ck = 2
are setting only part of the row. Assuming the schema is:
create table ks.tb(
pk int,
ck int,
v1 text,
v2 text,
primary key(pk, ck)
)
It would be highly useful to support not only insert and delete but also update. That would allow live migrations using Change Data Capture as a source of data.