scylladb / scylla-cdc-source-connector

A Kafka source connector capturing Scylla CDC changes
Apache License 2.0
48 stars 18 forks source link

feature request: support for collection types (LIST, SET, MAP) and UDT #9

Open hartmut-co-uk opened 3 years ago

hartmut-co-uk commented 3 years ago

As a consumer of my CDC event stream (Kafka topic), with table cdc enabled and collection types (LIST, SET, MAP) and UDT used, I'd like to receive change data of all columns of the *_cdc_log record, incl. collection type + UDT fields.

This would allow me to utilise the change event for stream processing as no data is omitted.

Example use cases:

brbrown25 commented 3 years ago

I’d be interested in assisting with this if no one else is.

avelanarius commented 3 years ago

The main difficulty in supporting collection types is supporting non-frozen types. In Scylla there are two types of collections/UDTs: frozen and non-frozen. When you update a frozen collection, its entire contents after the update are stored in the CDC log. On the other hand, you can partially update non-frozen collections (such as appending items to a list). In the CDC log, only the added/removed elements would be saved in such a case.

We (cc: @haaawk) have decided to not overcomplicate the generated Kafka message to accommodate those different operations in case of non-frozen collections (appending, removing, overwriting), especially since this is not what the Debezium model expects and most Sink Connectors would not support it. However, if we implemented support for postimages (#8 which we plan to do), a state of non-frozen collection/UDT after an update would be known (at the additional requirement that you have to enable postimages on your CDC table) - that way adding support for non-frozen collection types.

(You can read https://docs.scylladb.com/using-scylla/cdc/cdc-advanced-types/ for more info)

In the meantime, I have pushed (a very early) implementation of support of frozen collections: #12. To support post-images, we plan to implement a higher-level abstraction in scylla-cdc-java repo, that combines pre-images, delta and post-image rows and parses delta information of non-frozen collection updates.

hartmut-co-uk commented 3 years ago

(apologies for issue title rename, wrong browser tab -> please ignore)

hartmut-co-uk commented 3 years ago

Hi @avelanarius is there an ETA for post-image support? Alternatively could the support of frozen collections https://github.com/scylladb/scylla-cdc-source-connector/pull/12 be completed and merged any time soon?

hartmut-co-uk commented 3 years ago

To support post-images, we plan to implement a higher-level abstraction in scylla-cdc-java repo, that combines pre-images, delta and post-image rows and parses delta information of non-frozen collection updates.

@avelanarius is this already in the making, are you also looking for contributors? Are there any dependencies on an upcoming Scylla release? (4.6+/5.0)

jain-vandit commented 2 years ago

@avelanarius @hartmut-co-uk can we merge #12 to have support for collection type / UDT? Is there something blocking us to go ahead with this?

hartmut-co-uk commented 2 years ago

I have done more code changes on my fork last week to accommodate using UDT with Avro, but haven't had time to test them yet. I'll try to make time this week to progress this further.

haaawk commented 2 years ago

@avelanarius and @Lorak-mmk are working on support for frozen and non-frozen collection

jain-vandit commented 2 years ago

hi @hartmut-co-uk @avelanarius can I create a fork out of #12 and use it? Did you do any testing for this or shall I do it?

Lorak-mmk commented 2 years ago

If I remember correctly https://github.com/scylladb/scylla-cdc-source-connector/pull/12 contains a performance problem - if you want to use non-merged version, then https://github.com/scylladb/scylla-cdc-source-connector/pull/21 should be better. It is based on https://github.com/scylladb/scylla-cdc-source-connector/pull/12 , supports non-frozen collections too, and doesn't have the performance problem I mentioned.

hansh0801 commented 1 year ago

track

alonomri commented 10 months ago

Hi, is there a plan to merge https://github.com/scylladb/scylla-cdc-source-connector/pull/21? we can really use this feature.

arceushui commented 8 months ago

+1. This is an important feature

BruAPAHE commented 8 months ago

+1