Add Source Connectivity

jaredpetersen commented 5 years ago

Add support for using ArangoDB as a source system. Read from the ArangoDB write-ahead log (WAL) and create Kafka messages based on the record changes.

This has some complications. ArangoDB's WAL API is only supported for single-server instances.

When you use ArangoDB is cluster mode, you end up with multiple DB Servers that each maintains a write-ahead log. If you tail each log, you can end up with duplicates depending on how replication was set up. Additionally, there has to be some work to ensure that the records from each of the individual logs are written into Kafka in the order that they were written into ArangoDB. While we're doing this work, we may be able to do some de-duplication stuff as well based on timestamps. We'll have to see when we get there.

In all likelihood, we probably won't be able to provide an exactly-once delivery guarantee; consumers will have to expect duplicate messages. This should be acceptable as exactly-once is only a guarantee assuming that nothing ever goes wrong with the producer, in which case consumers can expect to receive messages at least once.

One important thing to note is that this feature is not intended to be used for datacenter-to-datacenter replication. ArangoDB has its own solution for that as part of its enterprise offering (which incidentally uses Kafka). Our goal here is not to make enterprise features free; it's to hook up ArangoDB to Kafka.

LeonardoBonacci commented 4 years ago

Hi guy(s)! An ArangoDB source connector would be welcomed with open arms!

jaredpetersen commented 4 years ago

@LeonardoBonacci It's definitely in the works, but I've run into some snags.

ArangoDB in cluster mode doesn't have a good way to output data events. They have a Write Ahead Log (WAL) API, but it's not supported for clusters so the only way to get the events is to use a root (🤢 ) JSON Web Token for authentication and follow the WAL API for each node in the cluster. Since clusters can scale up and down, you also need to do node discovery so that new nodes can be followed. Additionally, the events do not have a global timestamp so it's very hard to combine those individual node event streams. The _rev key on each document is a psuedo global timestamp -- specifically a Hybrid Logical Clock. If you decode it (again, not supported), you can get the ordering of events and theoretically combine the streams using a windowing algorithm. Each of the streams will have duplicate data because of the replication scheme, so the data would ideally be de-duped by the connector before being output to Kafka.

It should be doable -- ArangoDB's paid offering uses a proprietary Kafka utility to perform cross-cluster syncing. That's not the goal of this project but I think we'd be retrieving the data from the database in a similar way.

I'm working on all this, but it's not trivial especially since I have a day job unrelated to this 🙂

But it's good to hear that there is interest!

LeonardoBonacci commented 4 years ago

Thanks a lot @jaredpetersen for your elaboration!

This morning I thought of ArangoDB and potentially offer (you) some help writing the source connector. Now, reading the very technical issues you are facing and not being an ArangoDB insider myself, I can only wish you the best of luck! :)

jaredpetersen / kafka-connect-arangodb

Add Source Connectivity #5