pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.47k stars 1.87k forks source link

`sink_parquet` doesn't work on LazyFrames created by `scan_ndjson` #10964

Open sd2k opened 1 year ago

sd2k commented 1 year ago

Problem description

The following code gives a (slightly confusing) error:

pl.scan_ndjson(input_file).sink_parquet(output_file)
# thread '<unnamed>' panicked at /Users/ben/repos/rust/polars/crates/polars-lazy/src/physical_plan/planner/lp.rs:153:28:
# sink_parquet not yet supported in standard engine. Use 'collect().write_parquet()'
# note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This diff has an xfailing test, as well as a test showing that the same thing works fine for scan_csv.

I'd like to try and fix this but don't really know why sink_parquet isn't supported for a LazyFrame read using scan_ndjson 🤔

ritchie46 commented 1 year ago

Because we don't have a streaming ndjson reader yet. PR for this would be very welcome.

sd2k commented 1 year ago

I'd love to contribute this. Currently trying to understand what's different between CSV/JSON. https://github.com/pola-rs/polars/pull/4382 added a LazyJsonReader and scan_ndjson which seems like a lazy/streaming ndjson reader, but clearly I'm misunderstanding somewhere. It looks like a new source might need adding to polars-pipe?

Putnam14 commented 8 months ago

If there is no streaming ndjson reader, does scan_ndjson have a purpose in its current state? I ran into this when trying to convert a large ndjson file to parquet too.

brbickel commented 1 month ago

If there is no streaming ndjson reader, does scan_ndjson have a purpose in its current state? I ran into this when trying to convert a large ndjson file to parquet too.

From the docs: allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.

So unfortunately this doesn't help when trying to do the scan -> sink full file conversion.