lensacom / sparkit-learn

PySpark + Scikit-learn = Sparkit-learn
Apache License 2.0
1.15k stars 255 forks source link

Spark version 1.2.1, error AttributeError: <class 'rdd.BlockRDD'> object has no attribute treeReduce #37

Closed vishalrajpal25 closed 9 years ago

vishalrajpal25 commented 9 years ago

countvectorizer = SparkCountVectorizer(tokenizer=tokenize_pre_process) count_vector <class 'rdd.ArrayRDD'> from PythonRDD[22] at collect at rdd.py:168 sel_vt = SparkVarianceThreshold() red_vt_vector = sel_vt.fit_transform(count_vector) Traceback (most recent call last): File "", line 1, in File "base.py", line 63, in fit_transform return self.fit(Z, fit_params).transform(Z) File "featureselection.py", line 72, in fit , , self.variances = X.map(mapper).treeReduce(reducer) File "rdd.py", line 179, in getattr self.class**, attr)) AttributeError: <class 'rdd.BlockRDD'> object has no attribute treeReduce

I am using spark 1.2.1, and I think rdd has the method treeReduce. Would you have any idea why this error could be popping out of the ArrayRDD extendig BlockRDD

kszucs commented 9 years ago

I've successfully reproduced the issue and pyspark.RDD doesn't have the method treeReduce, see https://github.com/apache/spark/blob/branch-1.2/python/pyspark/rdd.py

If You can't update spark, I suggest to change treeReduce to reduce.

Right now, the spark version support is an ongoing discussion. We'll update the reqiurements in the README.

vishalrajpal25 commented 9 years ago

Thanks for the solution. We can't update spark yet so will change the method to reduce on rdd