pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.48k stars 1.87k forks source link

Alternative filesystem configuration for reading/writing functions #9436

Open danielgafni opened 1 year ago

danielgafni commented 1 year ago

Problem description

Description

I'd like to be able to specify the source/target filesystem in different ways:

Examples:

from upath import UPath

path = UPath("s3://my-bucket/my-file.parquet")
df.write_parquet(path)
df.write_parquet("s3://my-bucket/my-file.parquet", filesystem=filesystem)

Rationale

While polars currently supports remote filesystems via the storage_options argument, it's not always the best option. Sometimes (my use case is writing Dagster pipelines) objects like UPath or the filesystem are already available to the user, but these objects don't provide the storage_options. Therefore, polars should be able to accept them directly.

The current workarounds with no storage_options are:

Parquet

Writing is really easy:

with UPath(...).open("wb") as file:
    df.write_parquet(file)

Reading is a little more tricky, as the source path may be a directory, but also possible:

path = UPath(...)
df = pl.scan_pyarrow_dataset(ds.dataset(str(path), filesystem=path.fs))  # yay! has the `filesystem` argument!

Delta

No workarounds.

Writing - DataFrame.write_delta doesn't work with UPath and doesn't have a filesystem argument. deltalake.write.write_deltalake has the filesystem argument, so it should be really easy to add!

Reading - seems like no workaround available currently, as DeltaTable (which is used for reading) doesn't have a filesystem option. Should probably raise an issue upstream?


I bumped into these issues when working on dagster-polars. dagster has an abstraction called IOManager which handles loading/saving data for the user. There is a UPathIOManager which is built with upath.UPath, which is used to implement most of the other IOMangers which work with files. It's currently impossible to adopt it for DeltaLake, although it works for Parquet with some workarounds.

zundertj commented 1 year ago

Reading up on this, and checking what we do in various read_* and scan_* methods:

  1. Not sure os.PathLike is the right way forward. All it guarantees is that there is a dunder method __fspath__(), which returns a filesystem like path, but that does not guarantee in any way we can parse that like a file on a file system. fsspec, or other dependencies if fsspec does not cover it, are needed potentially.

  2. UPath seems interesting as a way to improve the API (personally not a fan of the separate storage_options kwargs littered around), but it also seems, no offense, not widely used? Are there major libraries which have adopted UPath?

  3. Important to distinguish between the eager (read + write) vs lazy api (scan + sink). For lazy, bespoke implementations are needed. We only support storage_options on scan_ipc and scan_parquet for example, whilst in addition to there eager equivalents there is also read_delta and read_csv supporting storage_options. Also, other sources such as json there probably simply wasn't a need so far, but could be added, as in most eager methods, there are code paths where we simply load up the data in Python, and pass it then down to the Rust code. This is something we could always do with fsspec in the loop.

jordandakota commented 1 year ago

To my knowledge it's just fsspec itself that's implemented UPath as well as dagster-io with the UPathIOManager.

danielgafni commented 1 year ago

Not sure os.PathLike is the right way forward. All it guarantees is that there is a dunder method fspath()

Oh, sure. Whatever is the correct type here. Probably just pathlib.Path then?

universal-pathlib is definitely not widely used. But:

  1. I think we can change that : ) a. It's very useful. I think pathlib for any filesystem is just a great idea. b. The library is a tiny wrapper around the other fsspec filesystem-specific libs, which are used a lot. It's very unlikely there are going to be any problems with it as everything is well-tested in the upstream libs.
  2. Even if we don't want to support universal-pathlib, we still can accept the general fsspec filesystem in the polars functions.
danielgafni commented 1 year ago

I found a workaround for fetching storage_options from UPath:

storage_options = path._kwargs.copy()

Edit: now it's available as path.storage_options

mkleinbort-ic commented 7 months ago

I hit this issue today - so it's still relevant I think.

No harm in supporting UPath as an input to the polars io operations imo - should be a reasonably modest lift.