lensacom / sparkit-learn

PySpark + Scikit-learn = Sparkit-learn
Apache License 2.0
1.15k stars 255 forks source link

Stop using Parallel for SparkFeatureUnion #69

Open taynaud opened 8 years ago

taynaud commented 8 years ago

See https://issues.apache.org/jira/browse/SPARK-12717 The parameter is still here for the converted to_scikit() object

I think it explain the flappy test on my previous PR

fulibacsi commented 8 years ago

Is this issue still present in Spark 2.0.0?

taynaud commented 8 years ago

I do not know, the issue appears randomly and I have not reproduced it on my cluster. I have add spark 2.0 to CI in #71 but as it is random, I do not know if it will allow to conclude.

I think this parallelization is not very usefull for a spark computation.

kszucs commented 8 years ago

Without threading a pipeline steps will be executed sequentially. I think n_jobs make sense, multiple dags will be submitted and executed in parallel. The overall level of parallelization can be increased via n_jobs.

Shouldn't we drop support for spark versions before 2.0.0?

taynaud commented 7 years ago

According to apache jira, it is still an issue in pyspark 2.0.2