ondra-m / ruby-spark

Ruby wrapper for Apache Spark
MIT License
227 stars 29 forks source link

Support for dataframes #6

Open gnilrets opened 9 years ago

gnilrets commented 9 years ago

I'm really interested in using spark and would love to be able to interact with it using Ruby. This gem looks like a great option. It doesn't look like it would natively support spark dataframes, right? Would there be any way to interact with dataframes using this gem? If not, what kind of effort would you expect would be required to build it in?

deric commented 9 years ago

The project is still in its alpha stage, basically it is a only a proof of concept. We've tried using Spark API from Ruby and it works! :) Currently we support only a subset of Spark API functionality. We'd like to attract more developers and extend supported functions. Right now ruby-spark runs better on MRI than JRuby (this might change with JRuby 9.0.0.0 release).

ruby-spark interacts with JVM (Scala backend), almost anything that is possible in Python, we should be able to do in Ruby. Have a look here on some benchmarks: http://ondra-m.github.io/ruby-spark/

ondra-m commented 9 years ago

DataFrame is part of Spark SQL which is in TODO.

gnilrets commented 9 years ago

This is the kind of project I could be interested in contributing to. If you had to put a rough estimate on the number of developer hours you think it would take to build in Spark SQL support, what would it be?

ondra-m commented 9 years ago

SQL implementation will take a long time. Currently there are more important things to do (some RDD and Mllib methods are missing, beter Proc serialization, ...).

xjlin0 commented 9 years ago

This is a really great gem and DataFrames will be the foundation to attract great projects for this gem. Could I help your development on the documentation?