Closed ddutt closed 3 weeks ago
@nameexhaustion Can you take a look at this?
Thanks for the report! We'd like to know some more details to help us debug. With the example you've given - I've tried it on both the latest version as well as 0.20.19 - on both versions it gave an error. Would you be able to provide a reproducible example that runs successfully on the older version you upgraded from, but fails on the latest versions? It would also help to know the exact error message (e.g. FileNotFoundError: No such file or directory (os error 2): non_existent.csv
) that you saw.
I get the following error:
In [1]: pl.scan_parquet('/tmp/foo/**/*.parquet', hive_partitioning=True, rechunk=False).collect()
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
Cell In[1], line 1
----> 1 pl.scan_parquet('/tmp/foo/**/*.parquet', hive_partitioning=True, rechunk=False).collect()
File ~/work/stardust/enterprise/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:2027, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
2025 # Only for testing purposes
2026 callback = _kwargs.get("post_opt_callback", callback)
-> 2027 return wrap_df(ldf.collect(callback))
ComputeError: expected at least 1 path
In [2]:
i'm surprised it worked in 0.20.x. I only noticed this when our tests started failing because we had accidentally created the conditions that triggered this. So, its possible it failed in 0.20.19.
In 0.20.19, I get this error:
In [1]: pl.scan_parquet('/tmp/foo/**/*.parquet', hive_partitioning=True, rechunk=False).collect()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 pl.scan_parquet('/tmp/foo/**/*.parquet', hive_partitioning=True, rechunk=False).collect()
File ~/work/dsengines/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py:1967, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
1964 # Only for testing purposes atm.
1965 callback = _kwargs.get("post_opt_callback")
-> 1967 return wrap_df(ldf.collect(callback))
FileNotFoundError: No such file or directory (os error 2): /tmp/foo/**/*.parquet
I see now why our code worked with this. It explicitly caught this error,
On the performance:
Pyarrow scan_dataset doesn't do globbing.
Whereas in polars scan_parquet, you pass a recursive glob, so we must traverse all subdirectories. I expected this to be much mere expensive with many subfolders.
Try scan_parquet
with hive partitioning where you only pass the directory.
On the performance:
Pyarrow scan_dataset doesn't do globbing.
Whereas in polars scan_parquet, you pass a recursive glob, so we must traverse all subdirectories. I expected this to be much mere expensive with many subfolders.
Try
scan_parquet
with hive partitioning where you only pass the directory.
I didn't know I could do that now. Will try. How does scan_pyarrow_dataset know that there's no data if it doesn't recurse?
scan_parquet
without the glob on a completely empty folder hierarchy suffers a PanicException:
In [2]: pl.scan_parquet('/tmp/foo/', rechunk=False, hive_partitioning=True).collect_schema()
thread '<unnamed>' panicked at crates/polars-io/src/path_utils/mod.rs:342:28:
index out of bounds: the len is 0 but the index is 0
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
Cell In[2], line 1
----> 1 pl.scan_parquet('/tmp/foo/', rechunk=False, hive_partitioning=True).collect_schema()
File ~/work/stardust/enterprise/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:2231, in LazyFrame.collect_schema(self)
2201 def collect_schema(self) -> Schema:
2202 """
2203 Resolve the schema of this LazyFrame.
2204
(...)
2229 3
2230 """
-> 2231 return Schema(self._ldf.collect_schema())
PanicException: index out of bounds: the len is 0 but the index is 0
In [3]:
The folder hierarchy is the same as in the example provided:
➜ tree /tmp/foo
/tmp/foo
└── bar=bat
1 directory, 0 files
And:
In [1]: from pyarrow.dataset import dataset
In [2]: ds = dataset('/tmp/foo/')
In [3]: pl.scan_pyarrow_dataset(ds).collect_schema()
Out[3]: Schema()
In [4]:
scan_parquet
without the glob on a completely empty folder hierarchy suffers a PanicException:
I will fix this in https://github.com/pola-rs/polars/pull/18620
Checks
Reproducible example
`First create an empty hive-partitioned dir:
Then execute:
Log output
No response
Issue description
When i upgraded to polars 1.5.0 recently, I found a
scan_parquet
behavior that is quite painful and slow compared to pyarrow.We use scan_parquet over a bunch of deeply nested hive-partitioned folder structure. Sometimes, there maybe no parquet file in that deeply nested structure. polars in 0.20.x (.19 for sure, maybe even .31), this caused collect() to silently return an empty dataframe. With polars 1.5.0, I get a
ComputeError
exception. I'm actually reading bunch of such dirs and concat the lazyframes, and doing a collect at the very end. As a consequece, 1.5.0 errors out withComputeError
, failing the entire `collect()``.pyarrow returns an empty Table (in 45 us for one such very deeply nested folder; 1300 subfolders). If i attempt to use
collect_schema()
as a way to catch this before adding it to the list of lazyframes to concat, on the same deeply nested folder it takes 16ms. If I doscan_pyarrow_dataset().collect_schema()
it takes 1.47us.Expected behavior
The
scan_parquet
call returns an empty dataframe for that lazyframe. If there are multiple lazyframes being concat viapl.concat
, then the effect ofcollect
on the concat lazyframes should be to ignore the empty dataframe or else you'll error out with mismatch column or schema mismatch.Installed versions