nlesc-sherlock / emma

Ansible playbook to create a cluster with GlusterFS, Docker, Spark and JupyterHub services
Apache License 2.0
3 stars 4 forks source link

Jupyter notebook gets often java heap space issues. #52

Closed romulogoncalves closed 7 years ago

romulogoncalves commented 7 years ago

For small data files, we often get java heap out of space exceptions. The problem might be related with the fact that spark context pre-set for JupyterHub and it can't be configured.

romulogoncalves commented 7 years ago

Interesting piece of information extracted from stackoverflow

"The memory you need to assign to the driver depends on the job. If the job is based purely on transformations and terminates on some distributed output action like rdd.saveAsTextFile, rdd.saveToCassandra, ... then the memory needs of the driver will be very low. Few 100's of MB will do. The driver is also responsible of delivering files and collecting metrics, but not be involved in data processing. If the job requires the driver to participate in the computation, like e.g. some ML algo that needs to materialize results and broadcast them on the next iteration, then your job becomes dependent of the amount of data passing through the driver. Operations like .collect,.take and takeSample deliver data to the driver and hence, the driver needs enough memory to allocate such data. e.g. If you have an rdd of 3GB in the cluster and call val myresultArray = rdd.collect, then you will need 3GB of memory in the driver to hold that data plus some extra room for the functions mentioned in the first paragraph. "

romulogoncalves commented 7 years ago

We just need to make sure the Driver node has enough memory when running SPARK ML operations. Or at least enough swap space.