pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.9k stars 1.92k forks source link

Polars cannot read Parquet files written by Spark #14377

Open sugibuchi opened 8 months ago

sugibuchi commented 8 months ago

Checks

Reproducible example

For a POC, we install Polars and some other libraries supporting Parquet format.

pip install -q polars pyspark pandas pyarrow

Generate sample Parquet file by Spark.

from pyspark.sql import SparkSession

spark.createDataFrame([{"col": v} for v in range(0,10000)]) \
    .repartition(10) \
    .write.parquet("test")

This code generates 10 Parquet files plus one _SUCCESS file under test/.

ls -1 test/

part-00000-9943d681-7a45-4de3-a699-63947ae0f189-c000.snappy.parquet
part-00001-9943d681-7a45-4de3-a699-63947ae0f189-c000.snappy.parquet
...
part-00009-9943d681-7a45-4de3-a699-63947ae0f189-c000.snappy.parquet
_SUCCESS

POC:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet
import polars as pl

# With Spark
spark.read.parquet("test").count()

# With Pandas + PyArrow
pd.read_parquet("test").shape

# PyArrow
pa.parquet.read_table("test").shape

# Polars (1)
pl.read_parquet("test").shape
# Polars (2)
pl.read_parquet("test/*").shape
# Polars (3)
pl.read_parquet("test/*.parquet").shape

Log output

# With Spark
10000

# With Pandas + PyArrow
(10000, 1)

# PyArrow
(10000, 1)

# Polars (1)
IsADirectoryError: expected a file path; 'test' is a directory

# Polars (2)
ComputeError: parquet: File out of specification: A parquet file must contain a header and footer with at least 12 bytes

# Polars (3)
(10000, 1)

Issue description

write.parquet(...) method of SparkSession in PySpark (and native Spark) outputs one dataframe as multiple, partitioned Parquet files in a specified output directory. Additionally, SparkSession creates an empty _SUCCESS file when writing all Parquet files is successfully done.

Such Parquet file "directory" having many Parquet files of one table plus some metadata files (_SUCCESS, etc.) is a very common structure we can find in various data lakes,

The current version of Polars can read Parquet file directories created by Spark. However, it is not intuitive compared to other existing libraries.

Other libraries can read a Parquet file directory just by specifing its location ("test" in the exmple), as seen in the example of Spark, Pandas+PyArrow and PyArrow.

On the other hand, Polars requires a precise glob pattern ("test/*.parquet") to read the same directory, as seen in the example Polars (3).

We can see two different problems:

Related issue: #9396

Expected behavior

read_parquet and scan_parquet should work with paths of Parquet file directories created by Spark.

Installed versions

``` --------Version info--------- Polars: 0.20.7 Index type: UInt32 Platform: Linux-5.15.0-1053-azure-x86_64-with-glibc2.31 Python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fsspec: 2022.11.0 gevent: hvplot: matplotlib: 3.8.2 numpy: 1.26.3 openpyxl: 3.1.2 pandas: 1.5.3 pyarrow: 14.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: 2.0.25 xlsx2csv: xlsxwriter: ```
deanm0000 commented 8 months ago

While I sympathize with your request, it isn't a bug that polars uses a syntax dissimilar to other libraries. As such I marked this as enhancement.

In the meantime try this to skip success files

https://stackoverflow.com/a/36295481/1818713

sugibuchi commented 8 months ago

OK. I agree that this ticket is more of an enhancement.

roman-janik commented 7 months ago

I strongly disagree that this is just an enhancement. The error for option Polars (1), where a directory path is used, is contradicting official docs:

source Path to a file, or a file-like object (by file-like object, we refer to objects that have a read() method, such as a file handler (e.g. via builtin open function) or BytesIO). If the path is a directory, files in that directory will all be read.

MatthiasRoels commented 5 months ago

It would indeed be very nice if polars, just like pyarrow and pandas + pyarrow, can natively read parquet files generated by Spark.

afaehnrich commented 3 months ago

I had a similar problem with the _SUCCESS files when I tried to read spark generated parquet files using AWS SDK for pandas. I found out the their read_parquet function has two parameters path_suffix and path_ignore_suffix that can be used to filter files. Adding something similar to pl.read_parquet would be a great improvement for Polars!