pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.3k stars 1.76k forks source link

Partitioning for parquet output type #9372

Closed artsiom-tsaryonau closed 1 month ago

artsiom-tsaryonau commented 1 year ago

Problem description

Hi!

I have been using Pandas until recently, but due to an issue with storing the date 9999-12-31 in a Parquet file, I had to switch to something else. I was searching for an alternative solution and decided to go with pyspark since I can take advantage of Databricks instance. However, today I came across this library that seems to have everything I need, except for partitioning (and I am also not certain about the timestamp type at the moment). Unlike Pandas, which supports partitioning out of the box when creating a Parquet file from a dataframe using DataFrame.to_parquet, it seems like this library does not offer built-in partitioning functionality for my scenario.

It would be nice to have similar functionality. Are there plans to add partitioning support for parquet output type?

Meanwhile, I was examining the source code, and for testing purposes, I decided to make a basic change in my fork using the same approach as Pandas parquet.py

I have written a small script to test it

import polars as pl

def test_partitioning():
    df = pl.DataFrame(
        data={
            'column_1': ["a", "b", "a", "d"],
            'column_2': ["1", "1", "2", "1"]
        }
    )

    df.write_parquet('folder/', use_pyarrow=True, pyarrow_options={'partition_cols': ['column_1']})

and it seems to be working fine (at the first glance at least)

folder\
    column_1=a\
        ac881002452f4481b405a141faaa76b8-0.parquet
    column_1=b
        ac881002452f4481b405a141faaa76b8-0.parquet
    column_1=d
        ac881002452f4481b405a141faaa76b8-0.parquet
artsiom-tsaryonau commented 1 year ago

Now that I think about, probably another way forward is to use "write_delta" function but then remove "_delta_log" folder

ion-elgreco commented 10 months ago

You can write partitioned parquet by using pyarrow in write_parquet and then you pass partition cols to the pyarrow_options

stinodego commented 1 month ago

Closing as a duplicate of https://github.com/pola-rs/polars/issues/17163