liormizr / s3path

s3path is a pathlib extension for AWS S3 Service
Apache License 2.0
206 stars 39 forks source link

Heads up: pathlib.VirtualPath may be coming soon #140

Open barneygale opened 1 year ago

barneygale commented 1 year ago

I've logged a CPython PR that adds a private pathlib._VirtualPath class:

https://github.com/python/cpython/pull/106337

If/when that lands, it's not much more work to drop the underscore and introduce pathlib.VirtualPath to the world. This would be Python 3.13 at the earliest.

Would it be suitable for use in s3path? It would be great to hear your feedback!

liormizr commented 1 year ago

Thanks @barneygale for letting us know As I looking at your PR with the new _PathBase implementation it doesn't really solve the issues that we have with pathlib extensions & integrations.

Instead of basing our Accessor layer in Accessor class (the old design that was removed from the pathlib few versions ago) basing S3Path from Path or from _PathBase it's not really matter from our point of view. When it will be in the python version we will do the change.

The issue that we do have is with how other packages react to path objects.

for example Pandas In Pandas, this is the api for read_csv (reading a csv file). if you are working with strings:

# for local path
pandas.read_csv('/path/to/table.csv', ...)
# or if you want to work with uri representation
pandas.read_csv('file://localhost/path/to/table.csv', ...)

And when you want S3 you can do like this:

# for S3 uri representation
pandas.read_csv('s3://path/to/table.csv', ...)

Now when you are working with Pandas and Path objects. Pandas are using the __fspath__ method to understand what to do. This brings two main issues.

  1. fspath from the spec doesn't return uri, it returns the file system path representation of the object So:
    
    from pathlib import Path
    from s3path import S3Path

local_path = Path('/path/to/table.csv') pandas.read_csv(local_path, ...) # it's the same like passing '/path/to/table.csv'

but in S3Path we braike

s3_path = S3Path('/path/to/table.csv') pandas.read_csv(s3_path, ...) # it's the same like passing '/path/to/table.csv' instead of s3://path/to/table.csv


2. the second issue is that there is multiple ways to define uri
When we are asking from Path the uri it's different between what we are giving and when Panda expect from us 
``` python
from pathlib import Path

local_path = Path('/path/to/table.csv')

pandas.read_csv(local_path.as_uri(), ...) # This will failed. file:///path/to/table.csv is not what Pandas expect
# not the same like:
pandas.read_csv('file://localhost/path/to/table.csv', ...) # This will failed. file:///path/to/table.csv is not what Pandas expect

My point is not that Pandas have issues, Pandas was just one example.

I think that we still have to think about the general purpose and design of methods like __str__, __fspath__, __repr__. Thinking about moving us for more uri's representation and how to give packages the tools / guidns to relay on them.

make sense? What do you think about that?

barneygale commented 1 year ago

Thanks for the feedback, that makes perfect sense.

My hope is that APIs like pandas.read_csv() will start accepting PathBase objects, roughly like this:

def read_csv(path: pathlib.PathBase | os.PathLike):
    if not isinstance(path, pathlib.PathBase):
        path = pathlib.Path(path)
    with path.open() as f:
        ...

It will take a while for third-party packages to begin accepting PathBase, similar to how it took a while for them to begin accepting os.PathLike.

Does that make sense?

On __fspath__(): the implementation of PathBase does not define __fspath__(), because it would result in nonsense like open(ZipPath('README')) opening a file called 'README' in the current directory -- very surprising!

On URIs: the implementation of PathBase.as_uri() raises UnsupportedOperation, because the path type might not use file: URIs. You're right that file: URIs are implemented differently across different packages. I'm not sure if leaning more heavily into URIs is the right way to go - it would involve adding, I think, a new __uri__() magic method, and probably a registry of scheme handlers.

merwok commented 1 year ago

with open(path) as f:

Oh, I wasn’t expecting that for Path classes, only methods! How can the builtin open work with a TarPath for example?

liormizr commented 1 year ago

Make perfect sense @barneygale Thanks for your answer

Regarding _PathBase we will do the move when it will be available Will you keep the privet name "_PathBase" or will it be changed to a public class?

barneygale commented 1 year ago

with open(path) as f:

Oh, I wasn’t expecting that for Path classes, only methods! How can the builtin open work with a TarPath for example?

Shit sorry, horrible typo on my part. Will correct.

barneygale commented 1 year ago

Regarding _PathBase we will do the move when it will be available Will you keep the privet name "_PathBase" or will it be changed to a public class?

It'll become a public class, hopefully in time for Python 3.13 (next year)

barneygale commented 7 months ago

Hello - I've published CPython's private PathBase class in a PyPI package: https://pypi.org/project/pathlib-abc/0.1.0/. No docs yet but I should have them ready soon.

If the PyPI package succeeds and matures, I'm hopeful we can make PathBase public in Python itself. Hope that's helpful. Grateful for any feedback you might have!