kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

Polars Datatype Catalog Entry Cannot Partition on Saving Parquet #4242

Open alexdavis24 opened 3 hours ago

alexdavis24 commented 3 hours ago

Description

Context

Steps to Reproduce

  1. Sample dataset that runs locally with no issues:
    df = pl.DataFrame(
    {"A": [1, 2, 3],
        "B": [4, 5, 6]}
    )
    path = "tmp/test.parquet"
    df.write_parquet(
    path,
    use_pyarrow=True,
    pyarrow_options={"partition_cols": ["B"]}
    )
    # this also runs with no issues
    df.write_parquet(
    path,
    partition_by=["B"]
    )
  2. Sample code following the same implementation:
    # pipelines.py
    from kedro.pipeline import Pipeline, node
    from kedro.pipeline.modular_pipeline import pipeline
    import polars as pl
    def my_func():
    return pl.DataFrame(
    {"A": [1, 2, 3],
        "B": [4, 5, 6]}
    )
    def create_pipeline() -> Pipeline:
    return pipeline(
    node(
                func=my_func,
                inputs={}
                outputs="my_entry",
                name="partition_polars"        
    )
# catalog.yml
# using Rust
my_entry:
  # also tried with polars.LazyPolarsDatset
  type: polars.EagerPolarsDataset 
  filepath: /tmp/test.parquet
  file_format: parquet
  save_args:
    partition_by: 
      - B
# catalog.yml
# using pyarrow (C++)
my_entry:
  type: polars.EagerPolarsDataset
  filepath: /tmp/test.parquet
  file_format: parquet
  save_args:
    use_pyarrow: True
    pyarrow_options:
      partition_cols: 
      - B
  fs_args:
    filesystem: pyarrow._fs.FileSystem

Expected Result

Actual Result

From Rust implementation:

DatasetError: Failed while saving data to data set
EagerPolarsDataset(file_format=parquet, filepath=/tmp/test.parquet,
load_args={}, protocol=file, save_args={'partition_by': ['dt1y']}).
'BytesIO' object cannot be converted to 'PyString'

From Pyarrow:

DatasetError: Failed while saving data to data set
 LazyPolarsDataset(filepath=/tmp/test.parquet, load_args={}, protocol=file, 
save_args={'pyarrow_options': {'compression': zstd, 'partition_cols': ['dt1y'],
'write_statistics': True}, 'use_pyarrow': True}).
Argument 'filesystem' has incorrect type (expected pyarrow._fs.FileSystem, got 
NoneType)

Your Environment

datajoely commented 3 hours ago

Just want to say thanks for such a clear write up and investigation 💪