Open sugibuchi opened 8 months ago
While I sympathize with your request, it isn't a bug that polars uses a syntax dissimilar to other libraries. As such I marked this as enhancement.
In the meantime try this to skip success files
OK. I agree that this ticket is more of an enhancement.
I strongly disagree that this is just an enhancement. The error for option Polars (1), where a directory path is used, is contradicting official docs:
source Path to a file, or a file-like object (by file-like object, we refer to objects that have a read() method, such as a file handler (e.g. via builtin open function) or BytesIO). If the path is a directory, files in that directory will all be read.
It would indeed be very nice if polars, just like pyarrow and pandas + pyarrow, can natively read parquet files generated by Spark.
I had a similar problem with the _SUCCESS
files when I tried to read spark generated parquet files using AWS SDK for pandas.
I found out the their read_parquet
function has two parameters path_suffix
and path_ignore_suffix
that can be used to filter files.
Adding something similar to pl.read_parquet
would be a great improvement for Polars!
Checks
Reproducible example
For a POC, we install Polars and some other libraries supporting Parquet format.
Generate sample Parquet file by Spark.
This code generates 10 Parquet files plus one
_SUCCESS
file undertest/
.POC:
Log output
Issue description
write.parquet(...)
method of SparkSession in PySpark (and native Spark) outputs one dataframe as multiple, partitioned Parquet files in a specified output directory. Additionally, SparkSession creates an empty _SUCCESS file when writing all Parquet files is successfully done.Such Parquet file "directory" having many Parquet files of one table plus some metadata files (
_SUCCESS
, etc.) is a very common structure we can find in various data lakes,The current version of Polars can read Parquet file directories created by Spark. However, it is not intuitive compared to other existing libraries.
Other libraries can read a Parquet file directory just by specifing its location (
"test"
in the exmple), as seen in the example ofSpark
,Pandas+PyArrow
andPyArrow
.On the other hand, Polars requires a precise glob pattern ("
test/*.parquet"
) to read the same directory, as seen in the examplePolars (3)
.We can see two different problems:
read_parquet
andscan_parquet
are invoked with a directory path, Polars should scan files in the directory.read_parquet
andscan_parquet
encounter a metadata file like_SUCCESS
during file scan, Polars should skip it.Related issue: #9396
Expected behavior
read_parquet
andscan_parquet
should work with paths of Parquet file directories created by Spark.Installed versions