liormizr / s3path

s3path is a pathlib extension for AWS S3 Service
Apache License 2.0
206 stars 39 forks source link

Feature request: Support for S3 version IDs #126

Closed nlangellier closed 1 year ago

nlangellier commented 1 year ago

Hi @liormizr It would be nice to allow accessing specific versions of S3 objects for buckets that have enabled versioning.

This could be accomplished by adding an optional version_id=None argument to PureS3Path.from_uri and PureS3Path.from_bucket_key as well as defining S3Path.__new__ like the following:

def __new__(cls, *args, version_id=None, **kwargs):
    self = super().__new__(cls, *args, **kwargs)
    self.version_id = version_id
    return self

Then _S3Accessor.open would update the transport_params dict to include path.version_id if it is not None.

I already have a working prototype ready for pull request if you approve of the idea.

What do you think?

liormizr commented 1 year ago

Hi @nlangellier

Thanks for the feature request

Basically I don't have an issue adding S3 features as long that we are not changing the basic pathlib api and keeping our api as similar as posible

can you show me here what will be the change in the usage of s3path?

I'll give you example: if we can add if versions are enable in the s3path.StatResult object that we get from S3Path.state And add a feature to call the specific version without changing the api that can work We need to check the syntax but maybe something like this:

path = S3Path('/<bucket>/<key>?VersionID=<version_id>')

Does that make sense for you?

nlangellier commented 1 year ago

What I was suggesting would look like one of the following

path = S3Path('/<bucket>/<key>', version_id='<version_id>')
path = S3Path.from_uri('s3://<bucket>/<key>', version_id='<version_id>')
path = S3Path.from_bucket_key(bucket='<bucket>', key='<key>', version_id='<version_id>')

This would only be an addition on top of the current API so the following would still work as usual:

path = S3Path('/<bucket>/<key>')
path = S3Path.from_uri('s3://<bucket>/<key>')
path = S3Path.from_bucket_key(bucket='<bucket>', key='<key>')

But you are right that this would constitute a deviation from the pathlib API.

Are you implying with your suggestion of

path = S3Path('/<bucket>/<key>?VersionID=<version_id>')

that we would parse the input S3 path string in a similar way that query params are parsed in a URL? If so I could certainly make that work. I would call out that '?' is a valid S3 object name character, but it is specifically called out by AWS as a character that might need special treatment, so perhaps it's OK. Thoughts?

liormizr commented 1 year ago

Thats exactly what I mean yes We need to do small research, and check if we can add "?" like URL parsing And what happens when adding "?" to a key - how s3path/boto3 react to it. If the answer from this research is clear and explicit behaviour, we can go forward. If not, need to think on a different approach.

nlangellier commented 1 year ago

In the class constructors it should be a simple matter to extract out the ?VersionId=<version id> part of the provided string before calling the parent class constructors from pathlib. So I think there should be no problem in making this compatible with s3path and boto3. The only thing we need to decide on is if we are OK with the edge case where someone has an S3 object whose bucket or key contains the string "?VersionId=". This would be an edge case that would fail with this implementation. However, as mentioned in the link in my previous comment, AWS lists the ? and = characters as special characters that may require special handling if they exist in key names. So maybe we are OK with this unlikely event?

If we are not OK with this edge case, I can think of some additional options for implementation:

  1. We could use my original idea of having an optional version=None argument in the constructors, with the change that this optional argument would only exist in PureS3Path.from_uri and PureS3Path.from_bucket_key since these methods don't have analogs in the pathlib API and thus already are a deviation from the pathlib API. i.e. versioned S3 objects would not be accessible through a direct call to S3Path like S3Path("/<bucket>/<key>").
  2. We could add some subset of he following constructors to PureS3Path:
    • from_version e.g. S3Path.from_version("/<bucket>/<key>", version_id="<version id>")
    • from_uri_version e.g. S3Path.from_uri_version("s3://<bucket>/<key>", version_id="<version id>")
    • from_bucket_key_version e.g. S3Path.from_bucket_key_version(bucket="<bucket>", key="<key>", version_id="<version id>")
  3. We could create a class VersionedS3Path that inherits from S3Path and whose constructor takes in a required version_id parameter. e.g. VersionedS3Path("/<bucket>/<key>", version_id="<version id>")

What do you think about the ?VersionId= edge case and the proposed alternate implementations?

nlangellier commented 1 year ago

@liormizr When implementing the "?VersionId=" delimiter idea, it occurred to me that pathlib.Path has two methods of creating instances. pathlib.Path("a/b/c.d") produces the same object as pathlib.Path("a", "b", "c.d"). With the former, attaching ?VersionId=<version id> to the end of the path string seems natural. But with the latter, it seems rather awkward. My opinion is that with this awkward nature and combined with the fact that "/<bucket>/<key>?VersionId=<version id>" has an edge case when either <bucket> or <key> actually contain the string "?VersionId=", that we should not use the "?VersionId=" delimiter idea.

I would personally vote for the third option above (i.e. class VersionedS3Path(S3Path):). We already have precedent to extend the pathlib API in certain cases (e.g. PureS3Path.from_uri and PureS3Path.from_bucket_key) and this would just be one more such extension. As suggested in your earlier comment, I would make sure that VersionedS3Path.stat would behave as expected. WDYT?

liormizr commented 1 year ago

Hi @nlangellier First thing I want to thank you for the interesting research that you did for this feature :-)

I agree with you, I think that for the first step to create a VersionedS3Path is the best approach. It's clean and in the future it will be more flexible to add it to s3path if we will see the need.

So what I think that is the next steps:

What do you think?

nlangellier commented 1 year ago

Perfect. I will put up a PR in the coming days fulfilling each of the bullet points you listed.

nlangellier commented 1 year ago

Hi @liormizr #139 was just opened to address this feature request (co-authored between @chnpenny and myself) . Please have a look and review when you get a chance.

liormizr commented 1 year ago

We merged the PR Tomorrow I'll upload a new version and update in this issue

liormizr commented 1 year ago

Deployed in version 0.5.0