pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.86k stars 1.92k forks source link

scan_parquet returns ComputeError if there are no parquet files #18393

Closed ddutt closed 3 weeks ago

ddutt commented 1 month ago

Checks

Reproducible example

`First create an empty hive-partitioned dir:

mkdir -p "/tmp/foo/bar=bat"

Then execute:

pl.scan_parquet('/tmp/foo/**/*.parquet', hive_partitioning=True, rechunk=False).collect()

Log output

No response

Issue description

When i upgraded to polars 1.5.0 recently, I found a scan_parquet behavior that is quite painful and slow compared to pyarrow.

We use scan_parquet over a bunch of deeply nested hive-partitioned folder structure. Sometimes, there maybe no parquet file in that deeply nested structure. polars in 0.20.x (.19 for sure, maybe even .31), this caused collect() to silently return an empty dataframe. With polars 1.5.0, I get a ComputeError exception. I'm actually reading bunch of such dirs and concat the lazyframes, and doing a collect at the very end. As a consequece, 1.5.0 errors out with ComputeError, failing the entire `collect()``.

pyarrow returns an empty Table (in 45 us for one such very deeply nested folder; 1300 subfolders). If i attempt to use collect_schema() as a way to catch this before adding it to the list of lazyframes to concat, on the same deeply nested folder it takes 16ms. If I do scan_pyarrow_dataset().collect_schema() it takes 1.47us.

Expected behavior

The scan_parquet call returns an empty dataframe for that lazyframe. If there are multiple lazyframes being concat via pl.concat, then the effect of collect on the concat lazyframes should be to ignore the empty dataframe or else you'll error out with mismatch column or schema mismatch.

Installed versions

``` --------Version info--------- Polars: 1.4.1 Index type: UInt32 Platform: Linux-6.5.0-45-generic-x86_64-with-glibc2.35 Python: 3.11.6 (main, Oct 13 2023, 14:12:02) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: nest_asyncio: numpy: 1.26.3 openpyxl: pandas: 2.2.2 pyarrow: 16.1.0 pydantic: 2.6.0 pyiceberg: sqlalchemy: 2.0.25 torch: xlsx2csv: xlsxwriter: ```
coastalwhite commented 1 month ago

@nameexhaustion Can you take a look at this?

nameexhaustion commented 1 month ago

Thanks for the report! We'd like to know some more details to help us debug. With the example you've given - I've tried it on both the latest version as well as 0.20.19 - on both versions it gave an error. Would you be able to provide a reproducible example that runs successfully on the older version you upgraded from, but fails on the latest versions? It would also help to know the exact error message (e.g. FileNotFoundError: No such file or directory (os error 2): non_existent.csv) that you saw.

ddutt commented 1 month ago

I get the following error:

In [1]: pl.scan_parquet('/tmp/foo/**/*.parquet', hive_partitioning=True, rechunk=False).collect()
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[1], line 1
----> 1 pl.scan_parquet('/tmp/foo/**/*.parquet', hive_partitioning=True, rechunk=False).collect()

File ~/work/stardust/enterprise/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:2027, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2025 # Only for testing purposes
   2026 callback = _kwargs.get("post_opt_callback", callback)
-> 2027 return wrap_df(ldf.collect(callback))

ComputeError: expected at least 1 path

In [2]: 
ddutt commented 1 month ago

i'm surprised it worked in 0.20.x. I only noticed this when our tests started failing because we had accidentally created the conditions that triggered this. So, its possible it failed in 0.20.19.

ddutt commented 1 month ago

In 0.20.19, I get this error:

In [1]: pl.scan_parquet('/tmp/foo/**/*.parquet', hive_partitioning=True, rechunk=False).collect()
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[1], line 1
----> 1 pl.scan_parquet('/tmp/foo/**/*.parquet', hive_partitioning=True, rechunk=False).collect()

File ~/work/dsengines/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py:1967, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1964 # Only for testing purposes atm.
   1965 callback = _kwargs.get("post_opt_callback")
-> 1967 return wrap_df(ldf.collect(callback))

FileNotFoundError: No such file or directory (os error 2): /tmp/foo/**/*.parquet

I see now why our code worked with this. It explicitly caught this error,

ritchie46 commented 1 month ago

On the performance:

Pyarrow scan_dataset doesn't do globbing.

Whereas in polars scan_parquet, you pass a recursive glob, so we must traverse all subdirectories. I expected this to be much mere expensive with many subfolders.

Try scan_parquet with hive partitioning where you only pass the directory.

ddutt commented 1 month ago

On the performance:

Pyarrow scan_dataset doesn't do globbing.

Whereas in polars scan_parquet, you pass a recursive glob, so we must traverse all subdirectories. I expected this to be much mere expensive with many subfolders.

Try scan_parquet with hive partitioning where you only pass the directory.

I didn't know I could do that now. Will try. How does scan_pyarrow_dataset know that there's no data if it doesn't recurse?

ddutt commented 1 month ago

scan_parquet without the glob on a completely empty folder hierarchy suffers a PanicException:

In [2]: pl.scan_parquet('/tmp/foo/', rechunk=False, hive_partitioning=True).collect_schema()
thread '<unnamed>' panicked at crates/polars-io/src/path_utils/mod.rs:342:28:
index out of bounds: the len is 0 but the index is 0
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[2], line 1
----> 1 pl.scan_parquet('/tmp/foo/', rechunk=False, hive_partitioning=True).collect_schema()

File ~/work/stardust/enterprise/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:2231, in LazyFrame.collect_schema(self)
   2201 def collect_schema(self) -> Schema:
   2202     """
   2203     Resolve the schema of this LazyFrame.
   2204 
   (...)
   2229     3
   2230     """
-> 2231     return Schema(self._ldf.collect_schema())

PanicException: index out of bounds: the len is 0 but the index is 0

In [3]: 

The folder hierarchy is the same as in the example provided:

➜ tree /tmp/foo
/tmp/foo
└── bar=bat

1 directory, 0 files

And:

In [1]: from pyarrow.dataset import dataset

In [2]: ds = dataset('/tmp/foo/')

In [3]: pl.scan_pyarrow_dataset(ds).collect_schema()
Out[3]: Schema()

In [4]: 
nameexhaustion commented 1 month ago

scan_parquet without the glob on a completely empty folder hierarchy suffers a PanicException:

I will fix this in https://github.com/pola-rs/polars/pull/18620

nameexhaustion commented 3 weeks ago

https://github.com/pola-rs/polars/pull/18620