Open usmanm opened 7 years ago
This all sounds good to me. Regarding replay, I think an ephemeral bgworker (or maybe a group of them) will work just fine for that.
The other important thing I think we need to address is packaging. The ZK client library is a huge pain to install on non-Ubuntu systems. There just happens to be an apt
repo for zookeeper_mt
. Building and installing librdkafka
is easy and clean, but zookeeper_mt
is not so I don't feel great about deferring that complexity to users.
Some options here:
pipeline_kafka
with core packages like we used to (ideal option)pipeline_kafka
packages that can be downloaded and installed (adds no value for users versus the first option and gives us more things to manage)The main thing is that adding the ZK dependency means it is no longer reasonable for us to expect users to build and install pipeline_kafka
themselves.
We might have to bump this up since currently the way our paralellism works will cause a latency of parallelism * timeout
. This is because each partition is consumed independently and we wait for timeout
when polling each partition. Super low timeouts burn a lot of CPU so that's not a fix.
For replay, we can probably start off by just running the replay code in the client process.
More suggestions:
Kafka 0.10.1 has now the ability to map timestamps to offsets, (see offsetForTimes at https://kafka.apache.org/0101/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html). It maps nicely to a where query with a "kafka_timestamp" and would be cool if pipelinedb incorporated that (you'll notice that the API returns a map of offsets per partition).
I like the fact you're storing offsets in Pipeline DB instead of Kafka / ZK because it will allow you to achieve exactly once semantics. If you plan on using ZK or Kafka for storing offsets then you're going to be either at least once or at most once, which may be an issue for a database
Avro support (integration with the Kafka schema registry).
Broker discovery shouldn't be done using Zookeeper (that's the old ways), but using Kafka (using boostrap servers)
Thanks for the suggestions @simplesteph! Will incorporate them when writing the new client.
Hi, What about Avro support here ?
We're going to keep the
0.8.2.2
branch alive for Kafka 0.8. Themaster
branch is only going to be compatible with Kafka 0.9+.The new
pipleline_kafka
will have have a completely new API. Some notes about the implementation and features:N
concurrent workers can be spawned where each of them will consume data for all topics configured(topic, relation)
consume actions and the worker processes will read from all the unique topics and broadcast each topic to all the relations they are mapped toUnanswered questions:
(topic, relation)
? Should be just have adhoc worker processes that do that using the simple API? Basically take arguments that look like(topic, stream, start_offset, end_offset)
.@derekjn: Thoughts?