Open ghuls opened 1 year ago
It still does not work by default in 0.19.8:
In [4]: %time pl.scan_csv("test.tsv", separator="\t", has_header=False, comment_char='#').with_columns([pl.col(pl.Utf8).cast(pl.Categorical)]).sink_ipc("test.categorical.ipc")
thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-pipe/src/executors/sinks/file_sink.rs:248:54:
called `Result::unwrap()` on an `Err` value: ArrowError(InvalidArgumentError("Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches."))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-pipe/src/executors/sinks/file_sink.rs:275:43:
called `Result::unwrap()` on an `Err` value: "SendError(..)"
thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-pipe/src/executors/sinks/file_sink.rs:275:43:
called `Result::unwrap()` on an `Err` value: "SendError(..)"
thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-pipe/src/executors/sinks/file_sink.rs:275:43:
called `Result::unwrap()` on an `Err` value: "SendError(..)"
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
File <timed eval>:1
File ~/software/anaconda3/envs/polars/lib/python3.8/site-packages/polars/lazyframe/frame.py:2057, in LazyFrame.sink_ipc(self, path, compression, maintain_order, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, no_optimization)
2010 """
2011 Evaluate the query in streaming mode and write to an IPC file.
2012
(...)
2046
2047 """
2048 lf = self._set_sink_optimizations(
2049 type_coercion=type_coercion,
2050 predicate_pushdown=predicate_pushdown,
(...)
2054 no_optimization=no_optimization,
2055 )
-> 2057 return lf.sink_ipc(
2058 path=path,
2059 compression=compression,
2060 maintain_order=maintain_order,
2061 )
PanicException: called `Result::unwrap()` on an `Err` value: "SendError(..)"
In [6]: pl.show_versions()
--------Version info---------
Polars: 0.19.8
Index type: UInt32
Platform: Linux-5.15.0-86-generic-x86_64-with-glibc2.10
Python: 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:57:06)
[GCC 9.4.0]
----Optional dependencies----
adbc_driver_sqlite: <not installed>
cloudpickle: 2.2.1
connectorx: <not installed>
deltalake: <not installed>
fsspec: <not installed>
gevent: <not installed>
matplotlib: <not installed>
numpy: 1.24.4
openpyxl: <not installed>
pandas: 1.5.3
pyarrow: 11.0.0
pydantic: <not installed>
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
Here's a small Rust example that reproduces the panic:
use polars::prelude::*;
fn main() {
let df = df![
"strings" => &["a", "b", "c", "d", "e"],
].unwrap();
df.lazy()
.with_columns([
col("strings").cast(DataType::Categorical(None))
])
.sink_ipc(
std::path::PathBuf::from("out.arrow"),
IpcWriterOptions::default(),
)
.unwrap()
}
run UdfExec
RUN STREAMING PIPELINE
df -> hstack -> parquet_sink
RefCell { value: [] }
thread '<unnamed>' panicked at /Users/bytenybbler/.cargo/registry/src/index.crates.io-6f17d22bba15001f/polars-pipe-0.35.4/src/executors/sinks/file_sink.rs:286:54:
called `Result::unwrap()` on an `Err` value: InvalidOperation(ErrString("Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches."))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at /Users/bytenybbler/.cargo/registry/src/index.crates.io-6f17d22bba15001f/polars-pipe-0.35.4/src/executors/sinks/file_sink.rs:336:14:
called `Result::unwrap()` on an `Err` value: Any { .. }
On the other hand, performing the categorical cast prior to the streaming operation does not cause a panic:
use polars::prelude::*;
fn main() {
let categoricals = Series::new("strings", ["a", "b", "c", "d", "e"])
.cast(&DataType::Categorical(None)).unwrap();
let df = DataFrame::new(vec![categoricals]).unwrap();
df.lazy()
.sink_ipc(
std::path::PathBuf::from("out.arrow"),
IpcWriterOptions::default(),
)
.unwrap()
}
Seems like DictionaryTracker::insert
in crates/polars-arrow/src/io/ipc/write/common.rs
is trying to create a new dictionary for every differing string value it encounters in a column when casting to categorical. Since the IPC implementation only allows one dictionary per column, Polars does not like this.
Problem description
Write only one dictionary when sinking to IPC.
It would be great that when writing categorial data to an IPC sink, only one unified dictionary is written, instead of multiple ones as the IPC format does not support it.
As internally Polars still would add new values to the string cache, it shouldn't be a problem to write the full dictionary in the end as values that are not seen in the latest batch are not forgotten due to the string cache (unless it is physically required for a dictionary to be located close to its batch).
Not sure if it should be handled automatically or not, but reading parquet/ipc files with multiple categorical columns requires to enable the StringCache manually.