pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.63k stars 1.63k forks source link

Getting `.columns` of LazyFrame created by `scan_parquet` is slow and uses a lot of memory #16022

Open CangyuanLi opened 2 weeks ago

CangyuanLi commented 2 weeks ago

Checks

Reproducible example

Unfortunately, the data cannot be shared. However, the file is fairly large at 3,604,869 rows and 3,508 columns consisting of a mix of Int64 and String columns. When trying to view the columns of the LazyFrame like below

pl.scan_parquet(file).columns

I get:

CPU times: user 13.5 s, sys: 5.62 s, total: 19.1 s
Wall time: 19.2 s

Peak memory usage: ~14GB
Final memory usage: ~1GB

Log output

No response

Issue description

Curiously, on another similarly-sized parquet file which is 3,759,545 rows and 3,002 columns, I get

CPU times: user 49.3 ms, sys: 28.8 ms, total: 78.1 ms
Wall time: 81.1 ms

and negligible (maybe 100-200mb) memory usage, which is more along the lines of what I was expecting. As far as I know, the parquet file with poor performance was written out using sink_parquet with default options, so I don't think there is any difference in compression ratio or algorithm to the "normal" parquet file.

I also decided to compare getting the first 500 rows with DuckDB, since I am not sure if getting the columns is a 1:1 operation (I believe DuckDB reads the parquet metadata; I am not entirely sure if Polars does this).

pl.scan_parquet(f).fetch(500)

yields

CPU times: user 32.9 s, sys: 13.8 s, total: 46.7 s
Wall time: 47.1 s

Peak memory usage: ~25GB
Final memory usage: ~1.5GB

while

duckdb.sql("SELECT * FROM read_parquet('file.parquet') LIMIT 500")

yields

CPU times: user 14.5 s, sys: 6.23 s, total: 20.7 s
Wall time: 20.6 s

Peak memory usage: ~4.5GB
Final memory usage: ~1.5GB

Expected behavior

I would expect running .columns on a LazyFrame to return almost instantly.

Installed versions

``` --------Version info--------- Polars: 0.20.23 Index type: UInt32 Platform: Linux-4.18.0-372.71.1.el8_6.x86_64-x86_64-with-glibc2.28 Python: 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.1.0 connectorx: deltalake: fastexcel: fsspec: 2021.10.1 gevent: hvplot: matplotlib: 3.8.3 nest_asyncio: 1.5.5 numpy: 1.23.5 openpyxl: 3.0.10 pandas: 2.2.1 pyarrow: 15.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: 1.4.39 xlsx2csv: xlsxwriter: ```
CangyuanLi commented 2 weeks ago

I think this is related to #13092, where sink_parquet creates a large number of small row groups. Looking at the good file, the metadata is

<pyarrow._parquet.FileMetaData object at 0x7f80fc3c1cb0>
  created_by: Polars
  num_columns: 3002
  num_rows: 3759545
  num_row_groups: 14
  format_version: 2.6
  serialized_size: 3470050

After I run

pl.scan_parquet(f).sink_parquet("test.parquet")

the metadata becomes

<pyarrow._parquet.FileMetaData object at 0x7f02ec2f2e30>
  created_by: Polars
  num_columns: 3002
  num_rows: 3759545
  num_row_groups: 3752
  format_version: 2.6
  serialized_size: 1048446714

and I see the poor performance and memory usage pop up.