pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.49k stars 1.98k forks source link

Inconsistency among write_ipc, sink_ipc, and scan_ipc #11581

Open hleumas opened 1 year ago

hleumas commented 1 year ago

Checks

Reproducible example

import polars as pl

pl.DataFrame({'values': [0, 1, 2]}).lazy().sink_ipc('example.out')
pl.scan_ipc('example.out').collect()

Log output

Could not mmap compressed IPC file, defaulting to normal read. Toggle off 'memory_map' to silence this warning.

Issue description

Reading documentation, one learns that:

This leads to super inconsistent behaviour where stuff suddenly breaks when one replaces write_ipc with its lazy version. Moreover, the fact that default behaviour scan_ipc isn't compatible with the default behaviour of sink_ipc is confusing as well.

Expected behavior

At minimum, sink_ipc followed by scan_ipc should not emit warnings. This can be achieved either by disabling default memory mapping in scan_ipc or by changing default compression to uncompressed.

Ideally, sync and lazy versions should follow the same defaults.

Installed versions

--------Version info--------- Polars: 0.19.3 Index type: UInt32 Platform: macOS-13.6-arm64-arm-64bit Python: 3.11.4 (main, Jun 20 2023, 17:23:00) [Clang 14.0.3 (clang-1403.0.22.14.1)] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: deltalake: fsspec: gevent: matplotlib: numpy: 1.25.2 pandas: 2.1.1 pyarrow: 13.0.0 pydantic: sqlalchemy: xlsx2csv: xlsxwriter:
howsiyu commented 1 year ago

sink_ipc doesn't even have the option to set compression to uncompressed. I wonder what's the reason?

mutecamel commented 10 months ago

I hope I can sink_ipc uncompressed so that I can later scan_ipc mempry-mapped.

dankal444 commented 6 months ago

I also struggle to sink IPC uncompressed, for later mmap use. I have large amount of data, not fit for RAM.

The only option seems to lazy_df.collect().write_ipc(). But my data is too large.. This undermines whole concept of Lazy API.