trinodb / aws-proxy

Proxy for S3
Apache License 2.0
8 stars 4 forks source link

Abstract internal design of the proxy away from the S3 API #37

Open vagaerg opened 5 months ago

vagaerg commented 5 months ago

This is a follow up to #20 , originated in https://github.com/trinodb/s3-proxy/pull/36#discussion_r1617224604

While we may not yet want to support non-S3-like endpoints for the underlying storage, we should make sure the internal design of the proxy is not tied to the design of the S3 API.

The proxy should receive requests in S3 format, parse them into some standard record format that conveys enough information as to what operation is being performed and upon what object(s)/bucket(s)/prefix(es) (if applicable), and pass this to a "proxy" interface that deals with sending it to the real storage system and parse the response back into a standard format (if applicable). The proxy should then use this information to reconstruct an S3-like response.

For now, there will probably only be implementations for S3-like systems, but the interface should be flexible enough that there's no leakage of S3 implementation details.

For instance, the TrinoS3ProxyResource (external facing interface) is OK to assume that all requests are coming in S3 format (as the proxy will always expose an S3 API externally), but it should not assume that it will be using an S3-like endpoint to forward requests to:

https://github.com/trinodb/s3-proxy/blob/77fa27e659bdbb0cd5e1cf18bcace3e1a4012bac/trino-s3-proxy/src/main/java/io/trino/s3/proxy/server/rest/TrinoS3ProxyResource.java#L40

TrinoS3ProxyClient also implements some logic to parse the incoming request: https://github.com/trinodb/s3-proxy/blob/77fa27e659bdbb0cd5e1cf18bcace3e1a4012bac/trino-s3-proxy/src/main/java/io/trino/s3/proxy/server/rest/TrinoS3ProxyClient.java#L76

This would probably be better centralised as mentioned above, such that TrinoS3ProxyClient gets a standard Java record and doesn't need to deal with URLs, HTTP methods, etc...

vagaerg commented 1 month ago

Some additional context: this would also be desirable for permissioning / authorisation decisions.

The current request record allows the authorisation system to make decisions based on the bucket & key a request is targeting quite easily. However, many requests may target the same bucket, key & HTTP method and yet have completely diferent meanings based on query parameters and headers.

For instance: GET with a bucket but without a key could be (just to name a few, there's more options):

While some of these are similar enough that they may not make too much of a difference from an authorisation standpoint (e.g. ListObjects and ListObjectsV2) others are fundamentally different operations (e.g., getting a bucket owner or its ACLs)