pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.19k stars 1.84k forks source link

`scan_parquet` + `sink_parquet` with same filename raises exception and truncates file #12843

Open cmdlineluser opened 9 months ago

cmdlineluser commented 9 months ago

Checks

Reproducible example

import tempfile
import polars as pl

f = tempfile.NamedTemporaryFile()

df = pl.DataFrame({"A": [1]})
df.write_parquet(f.name)

print(pl.read_parquet(f.name))
# ┌─────┐
# │ A   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# └─────┘

pl.scan_parquet(f.name).sink_parquet(f.name)
# ComputeError: parquet: File out of specification: A parquet file must contain a header and footer with at least 12 bytes

Log output

No response

Issue description

Just noticed Polars lets you use the same filename here, but it ends up truncating the file.

Not sure if this is supposed to be allowed or not, but if not, it should raise and leave the input intact?

Expected behavior

"Work" or raise exception and leave input intact.

Installed versions

``` --------Version info--------- Polars: 0.19.18 Index type: UInt32 Platform: macOS-13.6.1-arm64-arm-64bit Python: 3.11.6 (main, Nov 2 2023, 04:39:40) [Clang 14.0.0 (clang-1400.0.29.202)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: 2023.6.0 gevent: matplotlib: numpy: 1.26.2 openpyxl: pandas: 2.0.3 pyarrow: 12.0.1 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
Putnam14 commented 8 months ago

I just got bit by this too, pl.scan_parquet(f.name).with_columns(pl.col(pl.Utf8).str.strip_chars()).sink_parquet(f.name).

The seek_len function feels sketchy at best in an asynchronous context to determine the file size.

On https://github.com/rust-lang/rust/issues/59359 there's mention of using file.metadata().len() to get the size of a file without requiring a mutable borrow, maybe that's a better fit here?

I found an older discussion of this, and it does make sense that you wouldn't be able to write to the same file you're streaming from - it's already open. I do think the error message could be improved here.