kamu-data / kamu-cli

Next-generation decentralized data lakehouse and a multi-party stream processing network
https://kamu.dev
Other
303 stars 13 forks source link

[FeatureReq] : New engine - Apache Pulsar #31

Closed verbunk closed 3 years ago

verbunk commented 3 years ago

Hey Folks,

The project looks awesome! I'd like to propose an app integration / new engine with Apache Pulsar. It's a streaming pub/sub platform with native support for local code for mutations/transforms of data. Each topic also supports AVRO resigstered with understanding of datamodel revisions.

-J

sergiimk commented 3 years ago

Hi. Thanks for your interest in the project.

Which aspect of Pulsar do you think would be useful in kamu?

I'm not familiar with that project, but from a quick look it seems to be primarily a message broker with some limited support for message processing via Pulsar Functions.

So it's closer to the first-generation stateless stream processing systems that don't provide features like aggregations, joins, or event-time processing semantics that Spark and Flink give us.

If we use Pulsar as an engine - I think it would limit us to map and filter style operations only, which is not very useful.

verbunk commented 3 years ago

Pulsar has connectors for MQ, Kafka, and Fink so work can be sent to the best external processing engine for complex tasks. It also integrates Debezium so any supported DB can be a streaming source. It's, as you said, light transform processing engine but it has a lot of connectors to farm out work.

sergiimk commented 3 years ago

Engines in Open Data Fabric protocol run in a very restricted environment. They are limited to work only with an input data chunk and a checkpoint. They cannot reach out to external systems. This is how we ensure 100% reproducibility and achieve verifiable trust in a fully distributed system.

If this sounds confusing - think of it as how git validates the consistency of the entire commit history when you clone a repo from some remote host. In kamu's case though this history does not contain actual data, but only a history of operations. By repeating which we can fully reconstruct the data thanks to the reproducibility/determinism guarantee.

So in the context of derivative datasets - there is no value in having variety of connectors in the engine like Pulsar, since the sandbox environment prevents it from accessing anything external.

There might be some use for it in root datasets though when ingesting data, as we are planning to add support for the push API. While currently kamu only ingests data from files (local or downloaded from the web), push API would let you feed new data into kamu from Kafka or some other message broker system. But in this case this would not be an Engine (in ODF terms), but rather an alternative data ingestion mechanism.

Does this make sense?