lensacom / sparkit-learn

PySpark + Scikit-learn = Sparkit-learn
Apache License 2.0
1.15k stars 255 forks source link
apache-spark distributed-computing machine-learning python scikit-learn

Sparkit-learn

|Build Status| |PyPi| |Gitter| |Gitential|

PySpark + Scikit-learn = Sparkit-learn

GitHub: https://github.com/lensacom/sparkit-learn

About

Sparkit-learn aims to provide scikit-learn functionality and API on PySpark. The main goal of the library is to create an API that stays close to sklearn's.

The driving principle was to "Think locally, execute distributively." To accomodate this concept, the basic data block is always an array or a (sparse) matrix and the operations are executed on block level.

Requirements

Run IPython from notebooks directory

.. code:: bash

PYTHONPATH=${PYTHONPATH}:.. IPYTHON_OPTS="notebook" ${SPARK_HOME}/bin/pyspark --master local\[4\] --driver-memory 2G

Run tests with

.. code:: bash

./runtests.sh

Quick start

Sparkit-learn introduces three important distributed data format:

Basic workflow

With the use of the described data structures, the basic workflow is almost identical to sklearn's.

Distributed vectorizing of texts


SparkCountVectorizer
^^^^^^^^^^^^^^^^^^^^

.. code:: python

    from splearn.rdd import ArrayRDD
    from splearn.feature_extraction.text import SparkCountVectorizer
    from sklearn.feature_extraction.text import CountVectorizer

    X = [...]  # list of texts
    X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext

    local = CountVectorizer()
    dist = SparkCountVectorizer()

    result_local = local.fit_transform(X)
    result_dist = dist.fit_transform(X_rdd)  # SparseRDD

SparkHashingVectorizer
^^^^^^^^^^^^^^^^^^^^^^

.. code:: python

    from splearn.rdd import ArrayRDD
    from splearn.feature_extraction.text import SparkHashingVectorizer
    from sklearn.feature_extraction.text import HashingVectorizer

    X = [...]  # list of texts
    X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext

    local = HashingVectorizer()
    dist = SparkHashingVectorizer()

    result_local = local.fit_transform(X)
    result_dist = dist.fit_transform(X_rdd)  # SparseRDD

SparkTfidfTransformer
^^^^^^^^^^^^^^^^^^^^^

.. code:: python

    from splearn.rdd import ArrayRDD
    from splearn.feature_extraction.text import SparkHashingVectorizer
    from splearn.feature_extraction.text import SparkTfidfTransformer
    from splearn.pipeline import SparkPipeline

    from sklearn.feature_extraction.text import HashingVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.pipeline import Pipeline

    X = [...]  # list of texts
    X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext

    local_pipeline = Pipeline((
        ('vect', HashingVectorizer()),
        ('tfidf', TfidfTransformer())
    ))
    dist_pipeline = SparkPipeline((
        ('vect', SparkHashingVectorizer()),
        ('tfidf', SparkTfidfTransformer())
    ))

    result_local = local_pipeline.fit_transform(X)
    result_dist = dist_pipeline.fit_transform(X_rdd)  # SparseRDD

Distributed Classifiers

.. code:: python

from splearn.rdd import DictRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from splearn.feature_extraction.text import SparkTfidfTransformer
from splearn.svm import SparkLinearSVC
from splearn.pipeline import SparkPipeline

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

X = [...]  # list of texts
y = [...]  # list of labels
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parallelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
            columns=('X', 'y'),
            dtype=[np.ndarray, np.ndarray])

local_pipeline = Pipeline((
    ('vect', HashingVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC())
))
dist_pipeline = SparkPipeline((
    ('vect', SparkHashingVectorizer()),
    ('tfidf', SparkTfidfTransformer()),
    ('clf', SparkLinearSVC())
))

local_pipeline.fit(X, y)
dist_pipeline.fit(Z, clf__classes=np.unique(y))

y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])

Distributed Model Selection



.. code:: python

    from splearn.rdd import DictRDD
    from splearn.grid_search import SparkGridSearchCV
    from splearn.naive_bayes import SparkMultinomialNB

    from sklearn.grid_search import GridSearchCV
    from sklearn.naive_bayes import MultinomialNB

    X = [...]
    y = [...]
    X_rdd = sc.parallelize(X, 4)
    y_rdd = sc.parallelize(y, 4)
    Z = DictRDD((X_rdd, y_rdd),
                columns=('X', 'y'),
                dtype=[np.ndarray, np.ndarray])

    parameters = {'alpha': [0.1, 1, 10]}
    fit_params = {'classes': np.unique(y)}

    local_estimator = MultinomialNB()
    local_grid = GridSearchCV(estimator=local_estimator,
                              param_grid=parameters)

    estimator = SparkMultinomialNB()
    grid = SparkGridSearchCV(estimator=estimator,
                             param_grid=parameters,
                             fit_params=fit_params)

    local_grid.fit(X, y)
    grid.fit(Z)

ROADMAP
=======

- [ ] Transparent API to support plain numpy and scipy objects (partially done in the transparent_api branch)
- [ ] Update all dependencies
- [ ] Use Mllib and ML packages more extensively (since it becames more mature)
- [ ] Support Spark DataFrames

Special thanks
==============

- scikit-learn community
- spylearn community
- pyspark community

Similar Projects
===============

- `Thunder <https://github.com/thunder-project/thunder>`_
- `Bolt <https://github.com/bolt-project/bolt>`_

.. |Build Status| image:: https://travis-ci.org/lensacom/sparkit-learn.png?branch=master
   :target: https://travis-ci.org/lensacom/sparkit-learn
.. |PyPi| image:: https://img.shields.io/pypi/v/sparkit-learn.svg
   :target: https://pypi.python.org/pypi/sparkit-learn
.. |Gitter| image:: https://badges.gitter.im/Join%20Chat.svg
   :alt: Join the chat at https://gitter.im/lensacom/sparkit-learn
   :target: https://gitter.im/lensacom/sparkit-learn?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge
.. |Gitential| image:: https://api.gitential.com/accounts/6/projects/75/badges/coding-hours.svg
   :alt: Gitential Coding Hours
   :target: https://gitential.com/accounts/6/projects/75/share?uuid=095e15c5-46b9-4534-a1d4-3b0bf1f33100&utm_source=shield&utm_medium=shield&utm_campaign=75