URL encoding for url-unsafe characters?

robacarp commented 6 months ago

There are some characters which need to be encoded in S3 URI's. By a rather unfortunate accident, I'm currently trying to migrate several thousand objects which were written to S3 with a # in the key. This is one of the fuzzy places that S3 implementers all kind of disagree on what should be done -- some of them outright deny that character as valid in a key. AWS discourages it but doesn't seem to actually prevent you from doing anything with it. (Though I can't get aws cp to do anything helpful with paths which embed #...)

I don't really know if it's appropriate to encode these things as part of the library, or to just encourage folks to encoded their own strings better. What do you think?

robacarp commented 6 months ago

It seems like boto runs a URL escape:

aws --endpoint "http://localhost:3002" s3 ls s3://podb-development/development/broadcast#335692/transcription.vtt sends: GET /podb-development/development/broadcast%23335692/transcription.vtt

However, when I try to escape it myself, the signature goes invalid. I suspect that's because the signature library is re-encoding the path so the signed path is %2523 instead of %23 (urlencode(%) => '%25').

jgaskins commented 6 months ago

Oooh, this is an interesting question. I'm actually surprised any server implementations care about # at all. I was under the impression that browsers don't send the fragment of the URL to servers, so I wouldn't assume they'd need to work around it.

I want to say it's a good idea to URL-encode it in the shard. The fact that the object key is the URL path is an implementation detail that I don't think makes sense to expose the boundaries of.

My only concerns are:

If this shard URL-encodes the key and others reject keys with characters that are invalid for a URL path, do we end up storing keys that other S3 implementations can't access?
Do we run the risk of minor differences in URL encoding in different languages? As in, could some URL-encoding implementations encode characters that others might pass un-encoded?
- Corollary: Is there a risk that the S3 provider doesn't perform any URL decoding on their backend and therefore encoding them at all means that keys would be different? (I think this is what you were talking about in your latest comment — I wrote this last night but forgot to hit the submit button)

Could this be why there isn't consensus on what to do about these characters in keys?

robacarp commented 6 months ago

The AWS S3 Docs specify which characters are safe, might require special handling, and should be avoided. There isn't actually a blacklist of key characters. There's even silly statements like this:

Objects with a prefix of "./" must be uploaded or downloaded with the AWS Command Line Interface (AWS CLI), AWS SDKs, or REST API. You cannot use the Amazon S3 console.

So I think the behind-the-scenes is likely to be a little muddy.

Their recommendations on encoding:

The following characters in a key name might require additional code handling and likely need to be URL encoded or referenced as HEX. Some of these are non-printable characters that your browser might not handle, which also requires special handling

It's possible to interact with the s3api via json, xml, etc, not just rest-style http. The documentation specifies how that can be done, including how to escape characters which would cause trouble in xml.

There is a lot of variance in how s3-like backends implement things, but I think it's probably wise to stick as close to Amazon's implementation as possible.

As an aside, I came across this tidbit at the top of that page, which is a little telling about some of the logic they have going on behind the scenes:

Object key names with the value "soap" aren't supported for virtual-hosted-style requests. For object key name values where "soap" is used, a path-style URL must be used instead.

robacarp commented 6 months ago

The ruby client has this peculiar escape set of methods, which seems to do a little like Crystal's URI.escape_path:

       def uri_escape(string)
        CGI.escape(string.to_s.encode('UTF-8')).gsub('+', '%20').gsub('%7E', '~')
      end

      def uri_path_escape(path)
        path.gsub(/[^\/]+/) { |part| uri_escape(part) }
      end

The logic is repeated in the sigv4 signer.

jgaskins / aws

URL encoding for url-unsafe characters? #3