jgaskins / aws

AWS Client for the Crystal programming language
MIT License
12 stars 6 forks source link

URL encoding for url-unsafe characters? #3

Open robacarp opened 6 months ago

robacarp commented 6 months ago

There are some characters which need to be encoded in S3 URI's. By a rather unfortunate accident, I'm currently trying to migrate several thousand objects which were written to S3 with a # in the key. This is one of the fuzzy places that S3 implementers all kind of disagree on what should be done -- some of them outright deny that character as valid in a key. AWS discourages it but doesn't seem to actually prevent you from doing anything with it. (Though I can't get aws cp to do anything helpful with paths which embed #...)

I don't really know if it's appropriate to encode these things as part of the library, or to just encourage folks to encoded their own strings better. What do you think?

robacarp commented 6 months ago

It seems like boto runs a URL escape:

aws --endpoint "http://localhost:3002" s3 ls s3://podb-development/development/broadcast#335692/transcription.vtt sends: GET /podb-development/development/broadcast%23335692/transcription.vtt

However, when I try to escape it myself, the signature goes invalid. I suspect that's because the signature library is re-encoding the path so the signed path is %2523 instead of %23 (urlencode(%) => '%25').

jgaskins commented 6 months ago

Oooh, this is an interesting question. I'm actually surprised any server implementations care about # at all. I was under the impression that browsers don't send the fragment of the URL to servers, so I wouldn't assume they'd need to work around it.

I want to say it's a good idea to URL-encode it in the shard. The fact that the object key is the URL path is an implementation detail that I don't think makes sense to expose the boundaries of.

My only concerns are:

Could this be why there isn't consensus on what to do about these characters in keys?

robacarp commented 6 months ago

The AWS S3 Docs specify which characters are safe, might require special handling, and should be avoided. There isn't actually a blacklist of key characters. There's even silly statements like this:

Objects with a prefix of "./" must be uploaded or downloaded with the AWS Command Line Interface (AWS CLI), AWS SDKs, or REST API. You cannot use the Amazon S3 console.

So I think the behind-the-scenes is likely to be a little muddy.

Their recommendations on encoding:

The following characters in a key name might require additional code handling and likely need to be URL encoded or referenced as HEX. Some of these are non-printable characters that your browser might not handle, which also requires special handling

It's possible to interact with the s3api via json, xml, etc, not just rest-style http. The documentation specifies how that can be done, including how to escape characters which would cause trouble in xml.

There is a lot of variance in how s3-like backends implement things, but I think it's probably wise to stick as close to Amazon's implementation as possible.


As an aside, I came across this tidbit at the top of that page, which is a little telling about some of the logic they have going on behind the scenes:

Object key names with the value "soap" aren't supported for virtual-hosted-style requests. For object key name values where "soap" is used, a path-style URL must be used instead.

robacarp commented 6 months ago

The ruby client has this peculiar escape set of methods, which seems to do a little like Crystal's URI.escape_path:

       def uri_escape(string)
        CGI.escape(string.to_s.encode('UTF-8')).gsub('+', '%20').gsub('%7E', '~')
      end

      def uri_path_escape(path)
        path.gsub(/[^\/]+/) { |part| uri_escape(part) }
      end

The logic is repeated in the sigv4 signer.