tdebatty / spark-knn-graphs

Spark algorithms for building k-nn graphs
MIT License
41 stars 15 forks source link

OutOfMemoryError: GC overhead limit exceeded #15

Closed fvictorio closed 6 years ago

fvictorio commented 6 years ago

Hi, I'm trying to make a graph of a subset of the EMNIST dataset (10K instances) and I'm getting an OutOfMemoryError:

java.lang.OutOfMemoryError: GC overhead limit exceeded

I tried to make a MWE to reproduce this, you can find it here: https://github.com/fvictorio/spark-knn-graphs-outofmemory

I tried both locally and in google cloud (with four workers with 15GB of RAM each).

It's very likely that I'm doing something wrong, since I read in another issue that you tested the library with a dataset with millions of rows. But maybe the large amount of dimensions is causing trouble?

Thanks.

fvictorio commented 6 years ago

I think (but I'm not 100% sure) that the problem was that Spark's CSV reader uses too few partitions. Adding a .repartition after reading the dataset fixed the problem.