Open danielgafni opened 1 year ago
Reading up on this, and checking what we do in various read_*
and scan_*
methods:
Not sure os.PathLike
is the right way forward. All it guarantees is that there is a dunder method __fspath__()
, which returns a filesystem like path, but that does not guarantee in any way we can parse that like a file on a file system. fsspec
, or other dependencies if fsspec
does not cover it, are needed potentially.
UPath
seems interesting as a way to improve the API (personally not a fan of the separate storage_options
kwargs littered around), but it also seems, no offense, not widely used? Are there major libraries which have adopted UPath?
Important to distinguish between the eager (read
+ write
) vs lazy api (scan
+ sink
). For lazy, bespoke implementations are needed. We only support storage_options
on scan_ipc
and scan_parquet
for example, whilst in addition to there eager equivalents there is also read_delta
and read_csv
supporting storage_options
. Also, other sources such as json there probably simply wasn't a need so far, but could be added, as in most eager methods, there are code paths where we simply load up the data in Python, and pass it then down to the Rust code. This is something we could always do with fsspec
in the loop.
To my knowledge it's just fsspec itself that's implemented UPath as well as dagster-io with the UPathIOManager.
Not sure os.PathLike is the right way forward. All it guarantees is that there is a dunder method fspath()
Oh, sure. Whatever is the correct type here. Probably just pathlib.Path
then?
universal-pathlib
is definitely not widely used. But:
pathlib
for any filesystem is just a great idea.
b. The library is a tiny wrapper around the other fsspec
filesystem-specific libs, which are used a lot. It's very unlikely there are going to be any problems with it as everything is well-tested in the upstream libs.universal-pathlib
, we still can accept the general fsspec
filesystem in the polars
functions.I found a workaround for fetching storage_options
from UPath
:
storage_options = path._kwargs.copy()
Edit: now it's available as path.storage_options
I hit this issue today - so it's still relevant I think.
No harm in supporting UPath as an input to the polars io operations imo - should be a reasonably modest lift.
Problem description
Description
I'd like to be able to specify the source/target filesystem in different ways:
UPath
(actually justpathlib.Path
objects where possible)filesystem: fsspec.AbstractFileSystem
argument to reading/writing functions (some already have it)Examples:
Rationale
While
polars
currently supports remote filesystems via thestorage_options
argument, it's not always the best option. Sometimes (my use case is writingDagster
pipelines) objects likeUPath
or thefilesystem
are already available to the user, but these objects don't provide thestorage_options
. Therefore,polars
should be able to accept them directly.UPath
: implementation should be easy, asUPath
is a subclass ofpathlib.Path
, so the existing code will continue to work for it.filesystem
- should be even easier as it's usually created internally. Should be easy to pass it to upstream code directly.The current workarounds with no
storage_options
are:Parquet
Writing is really easy:
Reading is a little more tricky, as the source path may be a directory, but also possible:
Delta
No workarounds.
Writing -
DataFrame.write_delta
doesn't work withUPath
and doesn't have afilesystem
argument.deltalake.write.write_deltalake
has thefilesystem
argument, so it should be really easy to add!Reading - seems like no workaround available currently, as
DeltaTable
(which is used for reading) doesn't have afilesystem
option. Should probably raise an issue upstream?I bumped into these issues when working on dagster-polars. dagster has an abstraction called
IOManager
which handles loading/saving data for the user. There is aUPathIOManager
which is built withupath.UPath
, which is used to implement most of the other IOMangers which work with files. It's currently impossible to adopt it for DeltaLake, although it works for Parquet with some workarounds.