rheem-ecosystem / rheem

Rheem - a cross-platform data processing system
https://rheem-ecosystem.github.io
5 stars 0 forks source link

Implement SqlToRddOperator #53

Open luckyasser opened 7 years ago

luckyasser commented 7 years ago

At the time being, a plan with this rheem context:

RheemContext rheemContext = new RheemContext(conf) .with(Postgres.plugin()) .with(Spark.basicPlugin());

Would fail because the PlanEnumerator cannot find a way to convert a Sql channel to any of spark's supported channels. So we need a proper conversion operator.

Also, while we're at the same subject, I wonder why don't we implement direct RddToStream and StreamToRdd conversion operators, rather than what we're currently doing which is either: serializing into disk using hdfs sinks->filechannels -> FileSources Or Collecting -> collection channels -> collection sources

to convert between both platforms (spark and java). I'm aware that cost wise, we won't save anything as the data has to be collected/serialized anyway, but my question is from the semantic point of view of an execution plan, why do we need that above break down of the execution plan instead of just using PlatformAchannel->conversionOperator->PlatformBchannel scheme.

sekruse commented 7 years ago

👍 for a conversion operators SqlQueryChannels to RddChannels. My question would be, however, how such an operator could work. Maybe, with some Hadoop JDBC input format?

That being said, you can add .with(Java.channelConversionPlugin()), which provides ChannelConversions between collections, streams, and files.

About the rationale for breaking down channel conversions into multiple steps: When you have n different channel types, you would in principle need n(n-1) different conversion operators. It is tedious to provide and maintain all these conversion operators. It gets even worse, when different Rheem extensions add new channel types that don't know about each other - then, who should provide the conversion operator? It is a distinctive feature of Rheem to plan communication between platforms given only some atomic conversion operators. Let me therefore claim that our channel conversion graph approach is easy to maintain and flexible enough to handle extensions. :wink:

luckyasser commented 7 years ago

For the SqlQueryChannels to RddChannel conversion operator we can take a look at SparkSQl and get some inspiration (I know there's a way to convert a schemaRDD to a normal one, and a dataframe to an RDD).

+1 for the scalability and code maintenance argument above. This however might come at the cost of performance. I think that anyway someone has indeed to take a look at what's the best(fastest, more natural) way to connect channel A to Channel B, specially if they belong to different platforms. The SqlToStream operator itself is an example that using atomic conversion operators blindly is not the way to go(we could've just read everything from the db driver, i,e collect, write into a file or a java collection and proceed as normal), Instead the "natural" way of plugging the 2 platforms together is via exposing a JDBC Resultset via iterators to the Stream to be consumed directly. I also was wondering for instance, if there exists a way to connect java parallel streams directly to an RDD channel to consume in a similar manner (probably not).

sekruse commented 7 years ago

SchemaRDD sounds promising, would be great to have that.

Regarding your other point, I see no contradiction to what Rheem does. There is no denying that writing dedicated channel conversions can improve performance. And indeed, you can just register new conversions with Rheem. The optimizer will then consider this conversion in the optimization process and likely use it.

luckyasser commented 7 years ago

Makes sense.