Poor performances - Githubissues

orfi2017 commented 7 years ago

Hi to all! I just started to deep into spark machine learning, coming from scikit-learn. I tried to fit a linear SVC from scikit-learn and sparkit-learn. Splearn is remaining slower than scikit. How is this possible? (I am attaching my code and results)

import time as t from sklearn.datasets import make_classification from sklearn.tree import DecisionTreeClassifier from sklearn.svm import LinearSVC from splearn.svm import SparkLinearSVC from splearn.rdd import ArrayRDD, DictRDD import numpy as np

X,y=make_classification(n_samples=20000,n_classes=2) print 'Dataset created. # of samples: ',X.shape[0] skstart = t.time() dt=DecisionTreeClassifier() local_clf = LinearSVC() local_clf.fit(X,y)

sktime = t.time()-skstart print 'Scikit-learn fitting time: ',sktime # spstart= t.time() X_rdd=sc.parallelize(X,20) y_rdd=sc.parallelize(y,20) Z = DictRDD((X_rdd, y_rdd), columns=('X', 'y'), dtype=[np.ndarray, np.ndarray])

distr_clf = SparkLinearSVC() distr_clf.fit(Z,np.unique(y)) sptime = t.time()-spstart print 'Spark time: ',sptime

============== RESULTS ================= Dataset created. # of samples: 20000 Scikit-learn fitting time: 3.03552293777 Spark time: 3.919039011

OR for less samples: Dataset created. # of samples: 2000 Scikit-learn fitting time: 0.244801998138 Spark time: 3.15833210945

kszucs commented 7 years ago

If You have a dataset which can fit in memory (dataset with 20000 samples is small) sklearn/numpy/pandas will always be faster. Here is a good overview: http://dask.pydata.org/en/latest/spark.html

BTW We're moving away from spark. I suggest You to use dask instead.

orfi2017 commented 7 years ago

OK, I got it. You are right since for larger datasets some runs (not all of them) performed much better. Thanks a lot for the clarifications.

kszucs commented 7 years ago

You are welcome! I close this issue then.

lensacom / sparkit-learn

Poor performances #77