Open learningkeeda opened 1 month ago
Are you sure you are not going OOM because you load into memory?
I also noticed increased memory usage depending on how I read the data into my DataFrame.
[...] # Imports and stuff
parquet_with_hive = pl.read_parquet("s3://mybucket/mytable/col1=val1/myfile.parquet", hive_partitioning=True)
parquet_without_hive = pl.read_parquet("s3://mybucket/mytable/col1=val1/myfile.parquet", hive_partitioning=False)
iceberg_table = load_catalog("default").load_table("mytable")
iceberg = pl.scan_iceberg(iceberg_table)
iceberg = iceberg.filter(pl.col("col1") == val1).collect() # Only collect data from this one parquet file, not from other partitions
# Check that the data is indeed the same
assert_frame_equal(iceberg, parquet_with_hive)
assert_frame_equal(iceberg, parquet_without_hive)
print(parquet_with_hive.estimated_size("mb"))
print(parquet_without_hive.estimated_size("mb"))
print(iceberg.estimated_size("mb"))
With one of my test data sets it leads to the following output:
180.23824501037598
969.8529415130615
1467.6535663604736
I expected the estimated_size to be the same, since all DataFrames hold the same data. We would like to use Iceberg, but it seems that we will have to read the data using read_parquet + hive_partitioning=True instead.
Are you sure you are not going OOM because you load into memory?
Iceberg tables having String columns with low cardinality, are not efficiently loaded into memory as a result we are getting OOM. Ideally, if we have optimizations in place (such as dictionary encoding), we can extend polars usage to much larger size datasets.
Any update on this? I am having an issue reading 2 GB of a column of data where the iceberg size is 240 GB.
Checks
Reproducible example
catalog = load_catalog("default") table = catalog.load_table("test.table1")
pl.scan_iceberg(table ).collect()
Log output
No response
Issue description
When we tried to load Iceberg tables using scan_iceberg, we have found that string columns data is stored in its original, uncompressed form, leading to increased memory usage.
Columns with low cardinality takes up huge memory leading to out-of-memory errors for large datasets
Expected behavior
To avoid these issues, it's best to use dictionary encoding, especially for columns with low cardinality, where values are repeated frequently. Dictionary encoding reduces the storage size of data by encoding frequent values with shorter codes. Without it, data is stored in its original, uncompressed form, leading to increased memory usage
Installed versions