microsoft / Trill

Trill is a single-node query processor for temporal or streaming data.
MIT License
1.24k stars 133 forks source link

Support for Apache Beam #50

Open MedAnd opened 5 years ago

MedAnd commented 5 years ago

Consider adding support for Apache Beam's unified model for defining both batch and streaming data-parallel processing pipelines.

cybertyche commented 5 years ago

This has been under active consideration for quite some time now. Part of the issue is that Beam's model and Trill's are substantially different in several ways, enough that it would almost require a redesign or reimplementation of Trill into something far more DataFlow-like to do, which would eliminate most of what Trill does really well. That said, there is always the possibility that we could find some innovative way to support both.

Another issue has been that we've seen other implementations of Beam have remarkably bad performance over their native implementations, enough that they tend to lose interest quickly.

I'd love to get a conversation going on this though. What I would like to know, if possible, is:

MedAnd commented 5 years ago

Some initial feedback...

  1. High, especially if Trill supported distributed (multi-node) clusters like Apache Flink, Google Dataflow etc
  2. Able to replace an in-house Service Fabric hosted processing engine with an advanced & high performance .Net engine like Trill. Trill could be offered as an Azure Platform (alternative to Cloud Dataflow), packaged as a container etc? Many MS platforms are integrating Spark however a distributed (multi-node) Trill solution would be easier to adopt for MS technology shops as we'll be able to leverage the .Net Core eco-system, tooling etc!
  3. Unified, cross platform & cloud model, on-prem platform for defining both batch and streaming processing which avoids vendor and API lock-in.

Think this article is applicable: Why Apache Beam? A Google Perspective

MedAnd commented 5 years ago

A further discussion stimulator... Batch as a Special Case of Streaming

cybertyche commented 5 years ago

This is definitely good conversation fodder, and thank you. I think the biggest question here is if we go forward with a Beam API layer, where in the architecture would it sit? My immediate thought is that it would be atop Trill and not inside it, but that is certainly debatable.

As for batch as a special case of streaming, you've got no argument from me there. :-)

MedAnd commented 5 years ago

I would use either implementation in a large stream processing application if available today ☺️ I think the functionality offered by Azure Stream Analytics (ASA) is compelling, however on the distributed stream processing side I believe Azure does not have a true equivalent to Google Dataflow? Hope this project can change that... more conversation fodder to come ☺️