scylladb / scylla-cdc-source-connector

A Kafka source connector capturing Scylla CDC changes
Apache License 2.0
48 stars 18 forks source link

Support for collections #21

Open Lorak-mmk opened 2 years ago

Lorak-mmk commented 2 years ago

Based on https://github.com/scylladb/scylla-cdc-source-connector/pull/12 and includes changes from it - so review should be done on per-commit base.

This PR adds simple support for non-frozen collections.

There is new config option scylla.collections.mode, currently only possible value is "simple" - it selects format for non-frozen collections (in the future we could add preimage etc).

Simple mode collections format is described in README.md (along with frozen collections format). Just to give very brief description: non-frozen collections are represented as structs with 2 fields, "mode" and "elements", "mode" marks type of operation (add elements, remove elements, overwrite collection), "elements" are actual elements used in operation. For Set, "elements" is simply a Set. List is a map with timeuuid key type. When removing elements, values are null. Map is simply a Map. When removing elements, values are null. UDT is the most complicated. It is represented as struct, but each field is a Cell, and semantics are the same as with column's "Cell" - null means no change, non-null with null value field means removal, non-null with non-null value field means new value.

I didn't yet test it with Avro.

Lorak-mmk commented 2 years ago

Now that I think of it, maybe it's redundant to have "REMOVE" mode and removals should be represented as setting to null (as is currently the case for UDT)? Then we would have 2 modes, let's say "UPDATE" and "OVERWRITE", the difference between them would be whether the collection is cleared before operation. @avelanarius

avelanarius commented 2 years ago

Now that I think of it, maybe it's redundant to have "REMOVE" mode and removals should be represented as setting to null (as is currently the case for UDT)? Then we would have 2 modes, let's say "UPDATE" and "OVERWRITE", the difference between them would be whether the collection is cleared before operation. @avelanarius

I don't see how it would work for sets?

hartmut-co-uk commented 2 years ago

Opinion: Hi, for me it would be great if we could also have to option (configurable?) to just emit FROZEN collections 'as-is' (...always the full latest value). => so without the extra ELEMENTS_VALUE; REMOVED_ELEMENTS_VALUE; MODE_VALUE.

That would make the output record look cleaner and more like if you'd query Scylla directly.

Lorak-mmk commented 2 years ago

I pushed new version, with a bit different representation. It had to be changed, because previous one didn't work well with queries performing more than one modification on given collection, e.g.: UPDATE ks.t_list SET v = v - [6, 7], v = v + [4, 5] WHERE pk = 1;

Now, there are only 2 modes: OVERWRITE and MODIFY, and collection struct always has 2 fields: mode and elements. For list/maps, elements is a map, element is added/overwritten if value is not null, removed otherwise. For sets, elements is a map, with boolean value - true means value was added to set, false means it was removed. UDTs didn't change.

I also renamed SIMPLE mode to DELTA, to better reflect what it actually is.

@avelanarius @haaawk

Opinion: Hi, for me it would be great if we could also have to option (configurable?) to just emit FROZEN collections 'as-is' (...always the full latest value). => so without the extra ELEMENTS_VALUE; REMOVED_ELEMENTS_VALUE; MODE_VALUE.

That would make the output record look cleaner and more like if you'd query Scylla directly.

Yes, that would of course be better, but is harder (as it requires preimage/postimage usage), and will be added in the future - that's why I added config option to select mode for non-frozen collections.