ogrisel / spylearn

Repo for experiments on pyspark and sklearn
Other
79 stars 20 forks source link

Quick explanation? #1

Open twiecki opened 10 years ago

twiecki commented 10 years ago

This looks super interesting. I know the basics of pyspark but apparently it's not enough to understand what the trick is here as I can't find any reference to pyspark in the code. I assume writing functional python-code using map+reduce etc will allow mapping to spark?

Any hints on the central idea here would be greatly appreciated!

ogrisel commented 10 years ago

Right now this is more like a raw brain dump to identify issues when trying to PySpark to scale typical PyData operations on large collections.

The code uses PySpark via RDD instances. Have a look at the tests that create toy RDD instances using SparkContext.parallelize.

MLnick commented 10 years ago

In essence this is using PySpark as a backend for distribution, as opposed to say IPython parallel.

The advantages vs the alternatives may include the powerful programming

model, fault tolerance, HDFS compatibility and Sparks broadcast and accumulator variables (and hopefully PySpark Streaming at some point).

(Disadvantages may include performance due to java/Python interoperability but see the code

for blocking of RDDs of numpy arrays as an example of something that should improve performance substantially).

In my view the core focuses should probably be distributed versions of:

— Sent from Mailbox for iPhone

On Tue, Feb 18, 2014 at 8:13 PM, Olivier Grisel notifications@github.com wrote:

Right now this is more like a raw brain dump to identify issues when trying to PySpark to scale typical PyData operations on large collections.

The code uses PySpark via RDD instances. Have a look at the tests that create toy RDD instance using SparkContext.parallelize.

Reply to this email directly or view it on GitHub: https://github.com/ogrisel/spylearn/issues/1#issuecomment-35414817

freeman-lab commented 10 years ago

Great thoughts.

Re: Thomas's specific question, the dependence on PySpark might seem opaque, but the key idea is that most of the operations (e.g. all the functions in blocked_rdd_math) are being performed on PySpark RDDs (Spark's primary abstraction), created from a SparkContext when data are loaded (see the unit tests for example creation of RDDs). So the maps and reduces are transformations and actions, respectively, performed on an RDD.

Re: the roadmap, also worth adding that this project overlaps a bit with the AmpLab's MLlib, but I see it as both complementary, and potentially much broader and faster to develop for, because native PySpark implementation means we can draw on many existing libraries in sklearn/numpy/scipy within RDD operations (as in the linear_model example).

sryza commented 10 years ago

I also think that, where appropriate, it would be useful to supply sklearn-style frontends to existing MLLib algorithms. Not all algorithms are as easy transplants from sklearn as SGD, and until we have distributed python implementations, providing access to these algorithms on Spark through a familiar interface could benefit PyData users.

MLnick commented 10 years ago

@sryza yes absolutely, cf the k-means example. Will be nice to do a similar wrapper for recommendation code.