h2oai / sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster
https://docs.h2o.ai/sparkling-water/3.3/latest-stable/doc/index.html
Apache License 2.0
968 stars 360 forks source link

Can Same architecture of Sparkling water be followed to Integrate H2O with FLINK #7

Closed raghavchalapathy closed 7 years ago

raghavchalapathy commented 9 years ago

Hi Michal

I was working on integrating H20 with FLINK , but I observe that FLINK roadmap are following a Mahout DSL between Flink and Mahout along the same lines as the integration with Spark, rather than in the way it is done with H2O.

Refer to this link below http://mail-archives.apache.org/mod_mbox/flink-dev/201501.mbox/%3CCANC1h_s=DtNjS+KQcU-Uxdb=i+_o4KPV-EOKQacr-KpPFX_OKw@mail.gmail.com%3E

Kindly advice are there any limitations / necessity for integrating FLINK with H20

I believe FLINK would benefit with the fact that H20 provides deep learning out of the box Using R Data frames Please correct me if I am wrong

Raghav

mmalohlava commented 9 years ago

Hi Raghav,

thank you for the comment!

First, let me clarify motivation for Sparkling Water: from our point of view we want to enable Spark and H2O users to use both platforms together easily and hence bring benefits for both platforms. Mainly, if you have an existing Spark workflow and you would like to use advanced ML toolbox or other services provided by H2O.

Definitely, I can imagine much more tighter integration using Spark execution primitives, but current integration (which can be considered really simple) allows us to provide all H2O services (including UI, R/Python connectors) on the top of Spark, evaluate users demands, and mainly to create non-trivial applications on the top of both platforms.

I am not a Flink expert, however based on discussion with Flink guys Kostas and Stephan we figured out that the integration of H2O with Flink (in the way as it is done in Sparkling Water) should be straight forward (if i remember well they mentioned a few technical obstacles which were considered more like cosmetics details).

I can still see benefits for Flink and H2O from integration - the same benefits which we stress for Sparkling Water - although the integration would not be technically perfect.

Please let me know if you would like to discuss integration in more details, or do code review! Thank you! michal

Dne 4/16/15 v 10:11 PM raghavchalapathy napsal(a):

Hi Michal

I was working on integrating H20 with FLINK , but I observe that FLINK roadmap are following a Mahout DSL between Flink and Mahout along the same lines as the integration with Spark, rather than in the way it is done with H2O.

Refer to this link below http://mail-archives.apache.org/mod_mbox/flink-dev/201501.mbox/%3CCANC1h_s=DtNjS+KQcU-Uxdb=i+_o4KPV-EOKQacr-KpPFX_OKw@mail.gmail.com%3E

Kindly advice are there any limitations / necessity for integrating FLINK with H20

I believe FLINK would benefit with the fact that H20 provides deep learning out of the box Using R Data frames Please correct me if I am wrong

Raghav

— Reply to this email directly or view it on GitHub https://github.com/h2oai/sparkling-water/issues/7.

raghavchalapathy commented 9 years ago

Thank you so much for the eloborate insight !! shall get in touch with you approriately

with regards Raghav

alexeyegorov commented 8 years ago

Just out of curiosity: what is the current status on "flinking water" as the last response about this topic is around year ago?

mmalohlava commented 8 years ago

@alexeyegorov no updates on our side. No news from @raghavchalapathy so far.

Do you have some specific idea / use-case for Flinking Water? We can help with design and navigate development if you would like to participate.

btw: I love the name Flinking Water :+1: !

alexeyegorov commented 8 years ago

@mmalohlava I am writing a master thesis where I want to compare performance of Storm, Spark and Flink using some further abstraction layer (streams framework developed on my faculty in Dortmund). As we have a cooperation with physicists working on gamma-ray astronomy we have a case of high-volume image data from a telescope. At the moment some offline tool as Rapidminer (developed also in Dortmund) or WEKA is used to train model with Random Forest to detect gammas vs. hadrons. As part of my thesis I was thinking of a running pipeline in a Lambda or Kappa architecture style using some framework for building a model and then apply it on the new incoming stream, all using distributed computing. I found that H2O has very wide range of ML algorithms, Spark's MLlib is also pretty fancy, while Flink ML is still rather weak. Depending on how good or bad Flink would be (some people describe it as much faster than micro-batched Spark Streaming), it would be interesting to combine H2O with Flink. I am not sure if I am able to start off "flinking water" on my own but I thought it would be a win-win situation for Flink!? As far as it would be some people working on that it would get lucrative for myself to invest more time in it. I don't think I have time for this whole project on my own.

In case I need some support, I feel that I can get it from you and your team! ;)

p.s. after "sparkling water" it is rather straight forward to come up with "flinking water"... especially in German it sounds more fancy keyword! I definitely share your excitement about both, the word combination and the project itself.

p.s.s. btw, out motivation is current development of a large telescope array that will generate even much much more data than a single telescope. ;)

mmalohlava commented 8 years ago

Cool! Sounds great!

Are you planning to use some online learning? Right now in h2o we support only offline learners.

However, you can build model offline on a batch of data (you can use H2O directly) , then export model as a code (Model POJO) and compile it as a storm bolt which you can plug directly into Flink. The model would help you score incoming events.

Would it work for you?

Wow - what is a schedule regarding large telescope array?

chobeat commented 8 years ago

Good to see the topic is still alive. We have a huge interest in replicating the work done with Spark on Flink because it's the processing engine we use here at Radicalbit. We haven't started yet because the effort is not clear. So any news or any sign of interest from other companies or students toward this goal is welcome.

@alexeyegorov I agree that you don't really need Flinking Water for your result. Learning inside Flink is not necessary to do what you need. H2O is great for the portability of its models and it could be a good option. Never tried that on Flink though. I'm developing a library called Flink-JPMML that could help you when mature but right now is not something I consider good enough to be shared with others, and it's OT here anyway.

alexeyegorov commented 8 years ago

@mmalohlava schedule is very complex as it contains different pipelines. As far as I know around 2018 first telescopes should go online.

@chobeat @mmalohlava H2O gained my attention as I was comparing different possibilities of distributed machine learning frameworks. My main goal is just comparing Flink, Spark and Storm on our data. Machine learning is just a second step and does not have to be performed inside of H2O. But as it supports some amount of algorithms I thought it could be interesting for later experiments. Especially it could be interesting to test execution time of same algorithms inside Flink and Spark.

@chobeat Flink-JPMML seems not to be needed in our case as we use another abstraction layer (streams framework) on top of Spark, Flink or Storm. In this way we are able to write simple stream processing tasks in java and then just run them in Spark, Flink or elsewhere. We already implemented support for reading and applying models in PMML and simple JSON formats. But nevertheless, if you intend to share Flink-JPMML it would be interesting to look at it. ;)

jakubhava commented 7 years ago

Closing this for now as it's not actually an issue. Thanks for discussion!