Closed artsiom-tsaryonau closed 1 month ago
Now that I think about, probably another way forward is to use "write_delta" function but then remove "_delta_log" folder
You can write partitioned parquet by using pyarrow in write_parquet and then you pass partition cols to the pyarrow_options
Closing as a duplicate of https://github.com/pola-rs/polars/issues/17163
Problem description
Hi!
I have been using
Pandas
until recently, but due to an issue with storing the date9999-12-31
in a Parquet file, I had to switch to something else. I was searching for an alternative solution and decided to go withpyspark
since I can take advantage ofDatabricks
instance. However, today I came across this library that seems to have everything I need, except for partitioning (and I am also not certain about the timestamp type at the moment). Unlike Pandas, which supports partitioning out of the box when creating a Parquet file from a dataframe using DataFrame.to_parquet, it seems like this library does not offer built-in partitioning functionality for my scenario.It would be nice to have similar functionality. Are there plans to add partitioning support for parquet output type?
Meanwhile, I was examining the source code, and for testing purposes, I decided to make a basic change in my fork using the same approach as Pandas parquet.py
I have written a small script to test it
and it seems to be working fine (at the first glance at least)