Parallelisation not working for BIRCH clustering

joblib / joblib-spark

Joblib Apache Spark Backend

Apache License 2.0

240 stars 26 forks source link

Open siddharth-redseer opened 1 year ago

siddharth-redseer commented 1 year ago

I have used the provided syntax.

clusterer = Birch()
with parallel_backend('spark', n_jobs=100):
    clusterer.fit(df.toPandas())

Spark UI does not register it as a job and no executors get deployed. However, the example provided in docs gets registered as a sparkjob.

Error - "Unable to allocate 920. GiB for an array with shape (123506239506,) and data type float64"

WeichenXu123 commented 1 year ago

This looks like it runs out of memory, should not be an issue of joblib-spark.