pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.98k stars 1.83k forks source link

write_parquet(..., pyarrow_options={'partition_cols': [..., ]}) munges partition column #17619

Open wasade opened 1 month ago

wasade commented 1 month ago

Checks

Reproducible example

In [93]: pl.__version__
Out[93]: '1.1.0'

In [94]: df = pl.DataFrame([[0, 1, 2, 'a'], [3,4,5, 'b'], [6,7,8, 'c']], orient='row')

In [95]: df
Out[95]: 
shape: (3, 4)
┌──────────┬──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 ┆ column_3 │
│ ---      ┆ ---      ┆ ---      ┆ ---      │
│ i64      ┆ i64      ┆ i64      ┆ str      │
╞══════════╪══════════╪══════════╪══════════╡
│ 0        ┆ 1        ┆ 2        ┆ a        │
│ 3        ┆ 4        ┆ 5        ┆ b        │
│ 6        ┆ 7        ┆ 8        ┆ c        │
└──────────┴──────────┴──────────┴──────────┘

In [96]: df.write_parquet('../ex', use_pyarrow=True, pyarrow_options={'partition_cols': ['column_3', ]})

In [97]: ls ../ex
total 192K
drwxr-xr-x 2 dtmcdonald 4.0K Jul 13 15:08 column_3=[?  "a"?]
drwxr-xr-x 2 dtmcdonald 4.0K Jul 13 15:08 column_3=[?  "b"?]
drwxr-xr-x 2 dtmcdonald 4.0K Jul 13 15:08 column_3=[?  "c"?]

In [98]: pl.scan_parquet('../ex').collect()
Out[98]: 
shape: (3, 4)
┌──────────┬──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 ┆ column_3 │
│ ---      ┆ ---      ┆ ---      ┆ ---      │
│ i64      ┆ i64      ┆ i64      ┆ str      │
╞══════════╪══════════╪══════════╪══════════╡
│ 0        ┆ 1        ┆ 2        ┆ [        │
│          ┆          ┆          ┆   "a"    │
│          ┆          ┆          ┆ ]        │
│ 3        ┆ 4        ┆ 5        ┆ [        │
│          ┆          ┆          ┆   "b"    │
│          ┆          ┆          ┆ ]        │
│ 6        ┆ 7        ┆ 8        ┆ [        │
│          ┆          ┆          ┆   "c"    │
│          ┆          ┆          ┆ ]        │
└──────────┴──────────┴──────────┴──────────┘

Log output

No output generated

Issue description

Use of pyarrow_options={'partition_cols': [..., ]} appears to mangle the column being partitioned. As far as I can tell, the example is consistent with the documentation.

Not sure if relevant, but it appears that the partitioned column here is regarded as a large_string even though each value is a single character.

In [9]: pa = df.to_arrow()

In [10]: pa
Out[10]: 
pyarrow.Table
column_0: int64
column_1: int64
column_2: int64
column_3: large_string
----
column_0: [[0,3,6]]
column_1: [[1,4,7]]
column_2: [[2,5,8]]
column_3: [["a","b","c"]]

Thank you for your time and effort. Please let me know if additional information is helpful, or if there is a suggested direction which I could follow to issue a PR (if one is warranted).

Expected behavior

I would expect the output directory names to correspond to the values on write, and I would expect the column values to be unchanged on read.

Installed versions

``` In [8]: pl.show_versions() --------Version info--------- Polars: 1.1.0 Index type: UInt32 Platform: Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.17 Python: 3.11.3 (main, Apr 19 2023, 23:54:32) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: 0.18.0 fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: 3.8.3 nest_asyncio: numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 10.0.1 pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
cmdlineluser commented 1 month ago

Just a mention that .write_parquet_partitioned() was added in 1.1.0 which may be a possible workaround.

wasade commented 1 month ago

Thank you @cmdlineluse! I'll take it for a spin at the next chance

Julian-J-S commented 1 month ago

Just a mention that .write_parquet_partitioned() was added in 1.1.0 which may be a possible workaround.

Not sure if this gets removed again but partitioning was added to the original functions in #17586

wasade commented 1 month ago

On the surface, this work around seems to work although I don't believe it supports append right now which would be a nice to have.

cmdlineluser commented 1 month ago

The problem does seem to be large_string as you mentioned.

If we use pyarrow directly:

import pyarrow as pa
import pyarrow.parquet as pq

tbl = pa.table(
    {
        "year": [2020, 2022, 2021],                
        "animal": ["a", "b", "c"]
    }, 
    schema=pa.schema([("year", pa.int16()), ("animal", pa.large_string())])
)

pq.write_to_dataset(tbl, root_path="17619", partition_cols=["animal"])
>>> !ls 17619
animal=[?  "a"?] animal=[?  "b"?] animal=[?  "c"?]

Someone else seems to have had the same problem in https://github.com/pola-rs/polars/issues/15181 (although they say they get ... instead which I cannot replicate.)

It seems some others have asked why large_string is used.

I can't seem to find any pyarrow docs/issues but it seems partition_cols must not be of type large_string?