joblib / joblib-spark

Joblib Apache Spark Backend
Apache License 2.0
240 stars 26 forks source link

Parallelisation not working for BIRCH clustering #42

Open siddharth-redseer opened 1 year ago

siddharth-redseer commented 1 year ago

I have used the provided syntax.

clusterer = Birch()
with parallel_backend('spark', n_jobs=100):
    clusterer.fit(df.toPandas())

Spark UI does not register it as a job and no executors get deployed. However, the example provided in docs gets registered as a sparkjob.

Error - "Unable to allocate 920. GiB for an array with shape (123506239506,) and data type float64"

Screenshot 2022-08-27 at 12 33 08 AM
WeichenXu123 commented 1 year ago

This looks like it runs out of memory, should not be an issue of joblib-spark.