Polars Datatype Catalog Entry Cannot Partition on Saving Parquet

Description

I would like to save partitioned Polars parquet datasets which currently relies on Pyarrow using write_parquet
- Following documentation: https://docs.pola.rs/api/python/version/0.18/reference/api/polars.DataFrame.write_parquet.html
When running with default system (Rust-based implementation) or with Pyarrow, Kedro returns errors (see below)
I have encountered this writing locally and writing to s3
Attempting to explicitly pass a filesystem within catalog (see code below) does not work.
I believe this is due to handling as a BytesIO object in the save, rather than a direct write of an expected dataframe:
- https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-4.1.0/_modules/kedro_datasets/polars/lazy_polars_dataset.html#LazyPolarsDataset
- https://docs.kedro.org/projects/kedro-datasets/en/latest/_modules/kedro_datasets/polars/eager_polars_dataset.html

Context

Trying to partition a large dataframe using polars across a single column

Steps to Reproduce

Sample dataset that runs locally with no issues:

df = pl.DataFrame(
{"A": [1, 2, 3],
    "B": [4, 5, 6]}
)
path = "tmp/test.parquet"
df.write_parquet(
path,
use_pyarrow=True,
pyarrow_options={"partition_cols": ["B"]}
)
# this also runs with no issues
df.write_parquet(
path,
partition_by=["B"]
)

Sample code following the same implementation:

# pipelines.py
from kedro.pipeline import Pipeline, node
from kedro.pipeline.modular_pipeline import pipeline
import polars as pl
def my_func():
return pl.DataFrame(
{"A": [1, 2, 3],
    "B": [4, 5, 6]}
)
def create_pipeline() -> Pipeline:
return pipeline(
node(
            func=my_func,
            inputs={}
            outputs="my_entry",
            name="partition_polars"        
)

# catalog.yml
# using Rust
my_entry:
  # also tried with polars.LazyPolarsDatset
  type: polars.EagerPolarsDataset 
  filepath: /tmp/test.parquet
  file_format: parquet
  save_args:
    partition_by: 
      - B

# catalog.yml
# using pyarrow (C++)
my_entry:
  type: polars.EagerPolarsDataset
  filepath: /tmp/test.parquet
  file_format: parquet
  save_args:
    use_pyarrow: True
    pyarrow_options:
      partition_cols: 
      - B
  fs_args:
    filesystem: pyarrow._fs.FileSystem

Expected Result

New partitioned parquet file should be created locally or in S3

Actual Result

From Rust implementation:

DatasetError: Failed while saving data to data set
EagerPolarsDataset(file_format=parquet, filepath=/tmp/test.parquet,
load_args={}, protocol=file, save_args={'partition_by': ['dt1y']}).
'BytesIO' object cannot be converted to 'PyString'

From Pyarrow:

DatasetError: Failed while saving data to data set
 LazyPolarsDataset(filepath=/tmp/test.parquet, load_args={}, protocol=file, 
save_args={'pyarrow_options': {'compression': zstd, 'partition_cols': ['dt1y'],
'write_statistics': True}, 'use_pyarrow': True}).
Argument 'filesystem' has incorrect type (expected pyarrow._fs.FileSystem, got 
NoneType)

Your Environment

Kedro version used (pip show kedro or kedro -V): 0.19.3
Polars: 1.9.0 and 1.6.0
Python version used (python -V): 3.11
Operating system and version: MacOS M1 using Docker Compose + Docker Desktop

kedro-org / kedro

Polars Datatype Catalog Entry Cannot Partition on Saving Parquet #4242

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Your Environment