I tried both locally and in google cloud (with four workers with 15GB of RAM each).
It's very likely that I'm doing something wrong, since I read in another issue that you tested the library with a dataset with millions of rows. But maybe the large amount of dimensions is causing trouble?
I think (but I'm not 100% sure) that the problem was that Spark's CSV reader uses too few partitions. Adding a .repartition after reading the dataset fixed the problem.
Hi, I'm trying to make a graph of a subset of the EMNIST dataset (10K instances) and I'm getting an
OutOfMemoryError
:I tried to make a MWE to reproduce this, you can find it here: https://github.com/fvictorio/spark-knn-graphs-outofmemory
I tried both locally and in google cloud (with four workers with 15GB of RAM each).
It's very likely that I'm doing something wrong, since I read in another issue that you tested the library with a dataset with millions of rows. But maybe the large amount of dimensions is causing trouble?
Thanks.