Open edthrn opened 3 months ago
I just tested with a smaller dataset, ie instead of scanning all ~17k files, I only scan the first 50...
And it works :thinking:
Does it mean that the problem comes from the data itself (eg, a null value or something similar)? In that case, it's still odd that the unsorted version works as expected...
I managed to scan/filter/sort/sink
the whole dataset by processing it by batches of 500 source files.
for i, batch in enumerate(batched(s3_urls, batch_size=500)):
pl.scan_parquet(
batch,
).filter(
pl.col("date") < datetime.now() - timedelta(days=120)
).sort(
pl.col("value")
).sink_parquet(
f'/tmp/data_{i}.parquet',
)
Hence the supposition I gave above can be ruled out: it's not a data value/data type problem.
Minor problem now: I now have 34 parquet files at the end of the process (knowing that I have 17k source files in total), instead of a single large one.
I'm hitting something similar after upgrading to Polars v1.0.0 (note: I am using polars-u64-idx
)
thread '<unnamed>' panicked at crates/polars-core/src/series/series_trait.rs:234:9:
`shrink_to_fit` operation not supported for dtype `decimal[15,2]`
thread 'polars-5' panicked at crates/polars-pipe/src/executors/sinks/io.rs:271:49:
called `Result::unwrap()` on an `Err` value: "SendError(..)"
thread 'polars-4' panicked at crates/polars-pipe/src/executors/sinks/io.rs:271:49:
called `Result::unwrap()` on an `Err` value: "SendError(..)"
thread 'polars-3' panicked at crates/polars-pipe/src/executors/sinks/io.rs:271:49:
called `Result::unwrap()` on an `Err` value: "SendError(..)"thread 'polars-6
' panicked at crates/polars-pipe/src/executors/sinks/io.rs:271:49:
called `Result::unwrap()` on an `Err` value: "SendError(..)"
thread 'polars-8' panicked at crates/polars-pipe/src/executors/sinks/io.rs:271:49:
called `Result::unwrap()` on an `Err` value: "SendError(..)"
thread 'polars-9' panicked at crates/polars-pipe/src/executors/sinks/io.rs:271:49:
called `Result::unwrap()` on an `Err` value: "SendError(..)"
thread 'polars-10' panicked at crates/polars-pipe/src/executors/sinks/io.rs:271:49:
called `Result::unwrap()` on an `Err` value: "SendError(..)"
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
Cell In[7], [line 5](vscode-notebook-cell:?execution_count=7&line=5)
[1](vscode-notebook-cell:?execution_count=7&line=1) df = pl.scan_parquet(data)
[2](vscode-notebook-cell:?execution_count=7&line=2) (
[3](vscode-notebook-cell:?execution_count=7&line=3) df.sort(pl.col("l_orderkey"), pl.col("l_partkey"), pl.col("l_suppkey"))
[4](vscode-notebook-cell:?execution_count=7&line=4) .head(3)
----> [5](vscode-notebook-cell:?execution_count=7&line=5) .collect(streaming=True)
[6](vscode-notebook-cell:?execution_count=7&line=6) )
File ~/repos/ibis/venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
[1939](https://file+.vscode-resource.vscode-cdn.net/Users/cody/repos/ibis/~/repos/ibis/venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:1939) # Only for testing purposes atm.
[1940](https://file+.vscode-resource.vscode-cdn.net/Users/cody/repos/ibis/~/repos/ibis/venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:1940) callback = _kwargs.get("post_opt_callback")
-> [1942](https://file+.vscode-resource.vscode-cdn.net/Users/cody/repos/ibis/~/repos/ibis/venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942) return wrap_df(ldf.collect(callback))
PanicException: called `Result::unwrap()` on an `Err` value: "SendError(..)"
from:
df = pl.scan_parquet(data)
(
df.sort(pl.col("l_orderkey"), pl.col("l_partkey"), pl.col("l_suppkey"))
.head(3)
.collect(streaming=True)
)
where data
points to ~275GB of Parquet files
interestingly before upgrading I was hitting #17281 on this operation
I confirm that I still get the issue after upgrading to v1.0.0
Checks
Reproducible example
Given a very large data set (1b rows) stored on S3:
This works good:
But this doesn't:
I get the following error:
Log output
Issue description
I stumbled upon #16603 and tried the
POLARS_ACTIVATE_DECIMAL=1
hack.It was necessary for the first (unsorted) sample code to work, but it is apparently not sufficient for the sorted code sample to work.
I tested with both versions
0.20.31
and1.0.0rc2
: Same results.EDIT: also tested on 1.0.0 with same results
Expected behavior
I expected the lazy scan/filter/sort/sink to work as good as scan/filter/sink.
Installed versions