obedaeg / iceberg-duckdb-superset

This repository is a POC using Apache Iceberg, DuckDb and Superset
1 stars 0 forks source link

java.lang.OutOfMemoryError: Java heap space, when loading the data #2

Open teemuniiranen opened 2 months ago

teemuniiranen commented 2 months ago

I get the following exception in "df.groupby('partition_id').count().show()" cell when running the Loading Data -notebook:

java.lang.OutOfMemoryError: Java heap space
    at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:80)

I tried to add memory by adding new config options to the SparkSession builder:

from pyspark.sql import SparkSession
files_path = "/home/iceberg/playlist_data/*.json"
spark = (SparkSession
         .builder
         .appName("IcebergDemo")
         .config("spark.driver.memory", "6g")
         .config("spark.executor.memory", "6g")
         .getOrCreate()
        )

But it seems like it reuses the existing Spark sessions, and these are not in effect. There is also the following warning: "WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect."

Do you have any ideas on how to increase the memory allocation? I see that the spark-iceberg container is only using under 2 GiB at the moment (docker container stats).

teemuniiranen commented 2 months ago

I found a solution. I added a new configuration row to /opt/spark/conf/spark-defaults.conf inside the spark-iceberg container. In my case 4 GB was enough: spark.driver.memory 4g