jrochkind / faster_s3_url

Optimized generation of public and presigned AWS S3 GET URLs in ruby faster
MIT License
84 stars 7 forks source link

Add `endpoint` option to `Builder` #10

Open BenKanouse opened 1 week ago

BenKanouse commented 1 week ago

https://github.com/jrochkind/faster_s3_url/issues/9

Currently the Builder's initializer accepts a host param which assumes https. I like to use minio for local development and it isn't ideal to run it with https.

The normal Aws::S3::Client gem accepts an :endpoint option to by-pass this issue.

Here is a section of the docs that explains this option:

:endpoint (String, URI::HTTPS, URI::HTTP) — Normally you should not configure the :endpoint option directly. This is normally constructed from the :region option. Configuring :endpoint is normally reserved for connecting to test or custom endpoints. The endpoint should be a URI formatted like:
'http://example.com'
'https://example.com'
'http://example.com:123'

https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/S3/Client.html

jrochkind commented 20 hours ago

Thank you for this, and for the good description!

I'm sorry I've been very busy in other areas last week and this, but this is on my radar. I will try to look at it today, if not, know that I haven't forgotten you, but if you don't hear from me for a bit, please feel free to ping again on this issue!

jrochkind commented 20 hours ago

OK, so host -- you are right that when I implemented/documented it I hadn't totally thought it ought and it didn't quite work in that way.

But I have ended up using it for when you have a CloudFront distribution (or any other CDN) on top of S3. For public unsigned urls, you just need to change the hostname to be your CDN hostname. And this is a convenient tool to do it, in a speedy way (much speedier than AWS SDK), applying appropriate (a bit idiosyncratic) S3 escaping rules for keys, etc.

So I think we do need to keep the host argument there, working as it did -- at least for unsigned URLs. Perhaps host and endpoint are mutually exclusive, and if you specify both you get an ArgumentException? Or we just doc which takes priority.

jrochkind commented 19 hours ago

OK thanks for this!

Splitting it out into a separate class is nice and readable code organization wise. (I'd probably call the class BaseURI instead of just URI).

BUT. The whole point of this thing is performance. We need to make sure splitting it into a different class, or these changes in general, don't harm performance significanty -- especially in cases where this new feature is NOT being used.

Would you be interested in trying to run the in-repo benchmarking stuff against this new one -- maybe find a way to run the old unchanged code against the new code -- to see how it effects performance?

Thank you!

jrochkind commented 19 hours ago

Also add a CHANGELOG.md entry in PR please, with a URL to this PR!

BenKanouse commented 19 hours ago

Thanks for getting back to me! I can get back to this early next week.

It sounds like I have the following TODOs:

I am still a little confusing about the host feature. Does that mean when you use it you do specify the bucket name in the host string you provide? If so that makes a lot of sense, and I could update the specs to match that pattern.

jrochkind commented 18 hours ago

Is the difference between endpoint and host just that endpoint allows you to include a port too? I am having trouble understanding what's up here -- can you give me illustrative examples of what you would pass in as endpoint and where you would get it?

jrochkind commented 18 hours ago

Hey if I understand what endpoint is doing, I think this could and shoudl actually be a lot simpler.

Endpoint is just to allow you to specify http instead of https, and a port in addition to a host?

This code can be much simpler. If Endpoint is specified, just use the string given to construct the url, no?

Places in the code that have "https://" + self.host + path just need to be changed to have logic like:

if self.endpoint
     self.endpoint + path
else
    "https://" + self.host + path
 end

Even better, if on initialization a host is given and not an endpoint, just SET the endpoint to "https://#{host}", and then all subsequent logic can just be in terms of endpoint -- host is just a shortcut for setting endpoint that assumes https and no port.

I think this can just be a few lines of change, and doesn't need a separate class, or instantiating a stdlib URI object -- things I was worried about their effects on performance.

My use of host. I have an S3 Bucket. So something in the S3 bucket might be at the public URL "https://#my-bucket.s3.amazonaws.com/somepath/something.jpg". No signature, it's a public URL.

But i have an AWS Cloudfront CDN in front of it, so to access it through AWS Cloudfront I actually need to access it at https://d987adfa07df.cloudfront.net/somepath/something.jpg, so I set host to d987adfa07df.cloudfront.net.

BenKanouse commented 16 hours ago

I started down that expect path, and I ran into 2 problems.

  1. When endpoint it an IP address AWS has to form the path differently by adding the bucket. Which I try and point in this spec. I might be able to make that spec clearer by adding a starts_with assertion. So the path changes as well as the host/endpoint.
  2. The canonical_headers needs to not include the http or https. So when I have an endpoint param I need to be able to strip the scheme off of it.

Both the path and the canonical_headers are in the string_to_sign so it is important that those variables are exactly correct for the signature. So it makes it all it little more complicated than just changing the resulting url.

I do think you have a great idea about using the host to set an endpoint, which might simplify the logic! I didn't fully understand that host variable before, I can look into this next week. Thanks!