Open hartmut-co-uk opened 3 years ago
I’d be interested in assisting with this if no one else is.
The main difficulty in supporting collection types is supporting non-frozen types. In Scylla there are two types of collections/UDTs: frozen and non-frozen. When you update a frozen collection, its entire contents after the update are stored in the CDC log. On the other hand, you can partially update non-frozen collections (such as appending items to a list). In the CDC log, only the added/removed elements would be saved in such a case.
We (cc: @haaawk) have decided to not overcomplicate the generated Kafka message to accommodate those different operations in case of non-frozen collections (appending, removing, overwriting), especially since this is not what the Debezium model expects and most Sink Connectors would not support it. However, if we implemented support for postimages (#8 which we plan to do), a state of non-frozen collection/UDT after an update would be known (at the additional requirement that you have to enable postimages on your CDC table) - that way adding support for non-frozen collection types.
(You can read https://docs.scylladb.com/using-scylla/cdc/cdc-advanced-types/ for more info)
In the meantime, I have pushed (a very early) implementation of support of frozen collections: #12. To support post-images, we plan to implement a higher-level abstraction in scylla-cdc-java repo, that combines pre-images, delta and post-image rows and parses delta information of non-frozen collection updates.
(apologies for issue title rename, wrong browser tab -> please ignore)
Hi @avelanarius is there an ETA for post-image support? Alternatively could the support of frozen collections https://github.com/scylladb/scylla-cdc-source-connector/pull/12 be completed and merged any time soon?
To support post-images, we plan to implement a higher-level abstraction in scylla-cdc-java repo, that combines pre-images, delta and post-image rows and parses delta information of non-frozen collection updates.
@avelanarius is this already in the making, are you also looking for contributors? Are there any dependencies on an upcoming Scylla release? (4.6+/5.0)
@avelanarius @hartmut-co-uk can we merge #12 to have support for collection type / UDT? Is there something blocking us to go ahead with this?
I have done more code changes on my fork last week to accommodate using UDT with Avro, but haven't had time to test them yet. I'll try to make time this week to progress this further.
@avelanarius and @Lorak-mmk are working on support for frozen and non-frozen collection
hi @hartmut-co-uk @avelanarius can I create a fork out of #12 and use it? Did you do any testing for this or shall I do it?
If I remember correctly https://github.com/scylladb/scylla-cdc-source-connector/pull/12 contains a performance problem - if you want to use non-merged version, then https://github.com/scylladb/scylla-cdc-source-connector/pull/21 should be better. It is based on https://github.com/scylladb/scylla-cdc-source-connector/pull/12 , supports non-frozen collections too, and doesn't have the performance problem I mentioned.
track
Hi, is there a plan to merge https://github.com/scylladb/scylla-cdc-source-connector/pull/21? we can really use this feature.
+1. This is an important feature
+1
As a consumer of my CDC event stream (Kafka topic), with table cdc enabled and collection types (LIST, SET, MAP) and UDT used, I'd like to receive change data of all columns of the
*_cdc_log
record, incl. collection type + UDT fields.This would allow me to utilise the change event for stream processing as no data is omitted.
Example use cases: