tabular-io / iceberg-kafka-connect

Apache License 2.0
169 stars 31 forks source link

Question: io.tabular.iceberg.connect.transforms.DmsTransform and iceberg.tables.default-partition-by #248

Open rwilliams-r7 opened 1 month ago

rwilliams-r7 commented 1 month ago

I have a question I am trying to use both io.tabular.iceberg.connect.transforms.DmsTransform and iceberg.tables.default-partition-by together.

Based on the format I tried to use iceberg.tables.default-partition-by=hour(_cdc.ts) this does not seem to work. Now I looked over the code and it does not seem to be able to dig into the struct of _cdc in this case.

Does the ts need to be top level?

If so when using the io.tabular.iceberg.connect.transforms.DmsTransform how have you seen this used together?

Just to add is this the same for iceberg.tables.default-id-columns" :

Is the fix that all these need to be top level and would look something like where we move the identifies to the top level in the DmsTransform: { ts id data { } metadata { } }

possible using the CopyTo Transform.

gaydba commented 1 month ago

Seems that iceberg currently doesnt support partitioning on nested fields and there is a feature request for that https://github.com/apache/iceberg/issues/8175

Also CopyValue transform doesnt support nested fields, but it could be fixed in this project. It should use something like https://github.com/tabular-io/iceberg-kafka-connect/blob/690e62e0c40480856df4b9ba1250eecb81851c18/kafka-connect/src/main/java/io/tabular/iceberg/connect/data/Utilities.java#L123C24-L123C46 instead of raw get string