pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.23k stars 1.75k forks source link

Saving parquet to Google Cloud Storage with `df.write_parquet()` #14630

Open nagomiso opened 5 months ago

nagomiso commented 5 months ago

Description

I am using Polars with Python. When I attempted to save the Dataframe to Google Cloud Storage by specifying the URI of Google Cloud Storage and executing df.write_parquet(), a FileNotFoundError occurred, and the write operation failed.

In [1]: import polars as pl

In [2]: df = pl.DataFrame(
   ...:     {
   ...:        "foo": [1, 2, 3, 4, 5],
   ...:        "bar": [6, 7, 8, 9, 10],
   ...:        "ham": ["a", "b", "c", "d", "e"],
   ...:     }
   ...: )

In [3]: df.write_parquet("gs://my-bucket/test/foo.parquet.zstd")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-4-0985ea6f108d> in ?()
----> 1 df.write_parquet("gs://my-bucket/test/foo.parquet.zstd")

~/polars-test/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py in ?(self, file, compression, compression_level, statistics, row_group_size, data_page_size, use_pyarrow, pyarrow_options)
   3507                     **(pyarrow_options or {}),
   3508                 )
   3509 
   3510         else:
-> 3511             self._df.write_parquet(
   3512                 file,
   3513                 compression,
   3514                 compression_level,

FileNotFoundError: No such file or directory (os error 2)

pl.read_parquet() can directly load files from Google Cloud Storage, so similarly, I would like df.write_parquet() to be able to save files to Google Cloud Storage directly.

Environment

In the execution environment, the versions of the dependencies that seem to be relevant are as follows:

ritchie46 commented 5 months ago

I agree that this is a great feature. We should add this natively on the rust side.

Mmoncadaisla commented 4 months ago

I'm not quite experienced with Polars and understand that this is already known and you'd like a more straightforward interaction (apologies for the noise if so), just in case it's any useful to anyone, it is possible to directly write parquet files to GCS (as well as to other storage providers)

import polars as pl
import gcsfs

# df = pl.read_parquet('file_path')

# Assuming `df` is your Polars DataFrame, and that GOOGLE_APPLICATION_CREDENTIALS env variable si correctly set
fs = gcsfs.GCSFileSystem()

# Define your GCS bucket and file path
destination = "gs://bucket/folder/file.parquet"

# Write the DataFrame to a Parquet file directly in GCS
with fs.open(destination, mode='wb') as f:
    df.write_parquet(f)