hpgrahsl / kafka-connect-mongodb

**Unofficial / Community** Kafka Connect MongoDB Sink Connector -> integrated 2019 into the official MongoDB Kafka Connector here: https://www.mongodb.com/kafka-connector
Apache License 2.0
153 stars 60 forks source link

Unable to delete records using "mongodb.delete.on.null.values" property #85

Closed limadelrey closed 5 years ago

limadelrey commented 5 years ago

Hey,

First of all let me thank you for this brilliant connector. I've been considering to use it long ago so as a first use case I decided to make a data pipeline that captures change data from a PostgreSQL database (source) and replicates that content into a MongoDB database (sink).

I've managed to capture change data by using debezium/debezium-connector-postgresql:0.9.4 connector and some simple configurations. It records every create, read, update and delete operations. This is an example of a read event saved on Kafka:

{ "topic": "ics.i2s.concept", "key": { "id": "c3d92200-07eb-4f23-abef-07bf63eff435" }, "value": { "before": null, "after": { "ics.i2s.concept.Value": { "id": "c3d92200-07eb-4f23-abef-07bf63eff435", "id_trans_label": "1d88ce47-40e8-4505-a5e6-8404429f578b", "id_classification_type": "subject of insurance" } }, "source": { "version": { "string": "0.8.3.Final" }, "name": "ics", "db": "ics", "ts_usec": { "long": 1560350975241000 }, "txId": { "long": 1312 }, "lsn": { "long": 77773136 }, "schema": { "string": "i2s" }, "table": { "string": "concept" }, "snapshot": { "boolean": true }, "last_snapshot_record": { "boolean": false } }, "op": "r", "ts_ms": { "long": 1560350975241 } }, "partition": 0, "offset": 0 }

In order to sink this very same content into MongoDB I'm using hpgrahsl/kafka-connect-mongodb:1.3.1 connector and I've applied the following properties:

name=mongodb-sink-test
connector.class=at.grahsl.kafka.connect.mongodb.MongoDbSinkConnector
topics=ics.i2s.concept
tasks.max=1
mongodb.connection.uri=mongodb://root:example@db-functions:27017/ics?authSource=admin
mongodb.collection=test
transforms=unwrap,RenameField,ReplaceField

// Unwraps CDC event in order to sink only "after" object inside event's value
transforms.unwrap.type=io.debezium.transforms.UnwrapFromEnvelope

// Changes key field name from "id" to "_id"
transforms.RenameField.type=org.apache.kafka.connect.transforms.ReplaceField$Key
transforms.RenameField.renames=id:_id

// Removes value field name "id" in order to avoid duplicate data
transforms.ReplaceField.type=org.apache.kafka.connect.transforms.ReplaceField$Value
transforms.ReplaceField.blacklist=id

// Necessary properties in order to delete documents in MongoDB when receiving a delete event
mongodb.document.id.strategy=at.grahsl.kafka.connect.mongodb.processor.id.strategy.ProvidedInKeyStrategy
mongodb.delete.on.null.values=true

With this properties I've been able to sink the transformed content into MongoDB. This is an example of a generated MongoDB document:

{
    "_id" : "c3d92200-07eb-4f23-abef-07bf63eff435",
    "id_trans_label" : "1d88ce47-40e8-4505-a5e6-8404429f578b",
    "id_classification_type" : "subject of insurance"
}

However, I'm not being successful managing delete events. It works perfectly with read, create and update events though, but since this use case goal is to replicate the very same content from PostgreSQL in MongoDB I'm writing this issue with hope that you can help me finding my mistake.

Thank you in advance!

hpgrahsl commented 5 years ago

@limadelrey THX for you kind words!

From what I've seen you seem to be missing the fact that the sink connector has "native" support for Debezium CDC events. Please read again through this README section. Given that you configure the CDC handler for Postgres you don't need to take any special care about the deletion configuration. Likewise no need to use the Unwrap SMT or others. It should just work out of the box :)

Please let me know in case you succeed!

limadelrey commented 5 years ago

Thank you for the prompt answer @hpgrahsl!

I tried to use it, but I'll give it another try.

From my understanding using this handler won't transform the event to the format I posted previously. I think that it just replicates the event saved by Debezium with the "before", "after" and several other fields.

Am I wrong?

hpgrahsl commented 5 years ago

From my understanding using this handler won't transform the event to the format I posted previously. I think that it just replicates the event saved by Debezium with the "before", "after" and several other fields.

Actually no. By native CDC support I mean that the end result should give you the actual data only instead of replicating the full Debezium event info. That's what it was built for after all. But again please report back if anything doesn't work as expected / documented. Happy to hear from you.

hpgrahsl commented 5 years ago

addendum: the only actual difference is that you would get a different representation for the _id field, like shown below:

{ "_id": {
       "id":"..."
   },
   ...
}
limadelrey commented 5 years ago

So I went through the documentation and I ended up with the following properties:

name=test
connector.class=at.grahsl.kafka.connect.mongodb.MongoDbSinkConnector
topics=ics.i2s.concept
tasks.max=1
mongodb.connection.uri=mongodb://root:example@db-functions:27017/ics?authSource=admin
mongodb.collection=test

key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081

mongodb.change.data.capture.handler=at.grahsl.kafka.connect.mongodb.cdc.debezium.rdbms.postgres.PostgresHandler

However, I'm getting a NullPointerException:

org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception. at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:586) at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:322) at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:225) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at at.grahsl.kafka.connect.mongodb.MongoDbSinkTask.buildWriteModelCDC(MongoDbSinkTask.java:245) at at.grahsl.kafka.connect.mongodb.MongoDbSinkTask.processSinkRecords(MongoDbSinkTask.java:142) at at.grahsl.kafka.connect.mongodb.MongoDbSinkTask.lambda$null$0(MongoDbSinkTask.java:118) at java.util.ArrayList.forEach(ArrayList.java:1257) at at.grahsl.kafka.connect.mongodb.MongoDbSinkTask.lambda$put$1(MongoDbSinkTask.java:117) at java.util.HashMap.forEach(HashMap.java:1289) at at.grahsl.kafka.connect.mongodb.MongoDbSinkTask.put(MongoDbSinkTask.java:112) at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:564) ... 10 more

Am I using an erroneous configuration?

hpgrahsl commented 5 years ago

You might be hitting exactly this issue https://github.com/hpgrahsl/kafka-connect-mongodb/issues/82#issuecomment-496581444 is already resolved in the latest master thus not released yet. But I've shown a temporary workaround by means of re-configuration in the comment linked above.

limadelrey commented 5 years ago

Hey again,

I was indeed hitting the same issue that was reported earlier. I ended up with the following properties and it was successful:

name=test-connector
connector.class=at.grahsl.kafka.connect.mongodb.MongoDbSinkConnector
topics=ics.i2s.concept
tasks.max=1
mongodb.connection.uri=mongodb://root:example@db-functions:27017/ics?authSource=admin

value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081

mongodb.collections=test
mongodb.collection.ics.i2s.concept=test
mongodb.change.data.capture.handler.test=at.grahsl.kafka.connect.mongodb.cdc.debezium.rdbms.RdbmsHandler

At this moment it is possible to sink the topic events into MongoDB by using CDC Handler Configuration, so let me thank you for your support! However, and just like you said, I'm getting documents with the embedded id:

{
    "_id": {
        "id" : "c3d92200-07eb-4f23-abef-07bf63eff435"
    },
    "id_trans_label": "1d88ce47-40e8-4505-a5e6-8404429f578b",
    "id_classification_type": "subject of insurance"
}

Is there a way to flatten the document before sinking in order to match the following structure?

{
    "_id" : "c3d92200-07eb-4f23-abef-07bf63eff435",
    "id_trans_label" : "1d88ce47-40e8-4505-a5e6-8404429f578b",
    "id_classification_type" : "subject of insurance"
}

Thank you once again!

hpgrahsl commented 5 years ago

@limadelrey glad to hear you got it working! as for the flat _id field structure I have to say no, it's not possible. this has a reason and is based on a design decision. the thing is that: a) _id field itself is special and used in MongoDB only - unknown to other data bases and thus CDC sources -> if a source database would use a primary key column named xyz you would loose this information about the original column name which is typically not what you want. thus it gets wrapped as a sub-document into the _id field b) it should work generically enough to also support different cases. think about compound primary keys in source databases: e.g. a PK spanning 2 columns (ca and cb). these then form a compound _id field by means of a subdocument in MongoDB as well.

I hope this clarifies the ideas behind the current behaviour of the CDC mode. If you have suggestions and want to work to improve this feel free to implement it and send a PR. do you think we can close this issue then? If so please do - THX.

limadelrey commented 5 years ago

It does makes sense. The reality is that my use case is different from the desired behaviour of using CDC mode. Thank you so much for your explanations, I'll close the issue then. Cheers!

hpgrahsl commented 5 years ago

I hope this doesn't mean you won't try to leverage Kafka connect and the MongoDB sink connector since the _id structure not being a flat one actually doesn't change much. Others are doing it like this as well. And I think having a way to get proper CDC event replication towards MongoDB isn't something you can find easier and more reliable anywhere else :) thx also to the Debezium source connectors of course!

limadelrey commented 5 years ago

Of course not. I've finished my use case of capturing change data from a PostgreSQL database (by using Debezium PostgreSQL source connector) and replicate its content into a MongoDB database (by using your very own MongoDB sink connector). I'm very happy with the end result.

Once again, thank you so much for sharing it with the community.