Remove invalid UTF-8 characters

pkang commented 9 years ago

There are cases where environment values coming in may contain non-CGI escaped invalid UTF-8 characters. We have experienced this with requests coming in from Windows Internet Explorer 11 with query strings that contain smart quotes ("\x93\x94") with our Rails server. This would invariably cause stack traces further down the Rack stack because we had invalid strings.

The fix here removes invalid UTF-8 characters before passing it to cleaned_uri_string.

sbleon commented 9 years ago

Thanks, @pkang! Your comment explained the situation very clearly :-)

This is a pragmatic solution, but I want to make sure we're doing the most correct thing with those damn smart quotes. It seems to me like \x-encoded characters in a URL string should be treated literally. The only valid encodings are %-encoding and + signs, IIRC. Nothing else should get unescaped. So maybe we should double the backslash to prevent Ruby from trying to decode things like \x93. What do you think?

pkang commented 9 years ago

Hi @sbleon,

Based on what your idea, what if instead of double backslashing, we iterated through the given string and URI escaped invalid UTF-8 characters before passing it over to cleaned_uri_string?

As a proof of concept, something along the lines of:

"\x80fdsafdsa\x7f漢".chars.map do |c| 
  (c.bytes.length == 1 && c.bytes.first >= 128) ? CGI.escape(c) : c 
end.join

Which would give you "%80fdsafdsa\u007F漢". cleaned_uri_string would then strip out %80 from the string.

Would you like us to file a separate pull request to review?

sbleon commented 9 years ago

The goal of utf8-cleaner is to modify invalid encoded characters so that Ruby/Rack/Rails don't choke on them. We do this for %-encoded characters already. If there are other types of encoded characters that are causing problems, like the smart quotes you're seeing, we need to decide on the appropriate strategy for that type of encoding.

\x encoding is not valid in URIs, so those strings should never get decoded. Our goal should be to escape the backslashes in a minimally-intrusive way so that neither our code nor Ruby/Rack/Rails chokes on them.

Any code needed to handle this should go in URIString. The Middleware is a simple wrapper that passes stuff from the env to the URIString class for cleaning, so it shouldn't do any deep string inspection or processing.

To figure out the next steps, can you tell me whether it's our code or Ruby or Rack or Rails choking on the smart-quotes?

On Wed, Dec 17, 2014 at 3:52 AM, Pablo Kang notifications@github.com wrote:

Hi @sbleon https://github.com/sbleon,

Based on what your idea, what if instead of double backslashing, we iterated through the given string and URI escaped invalid UTF-8 characters before passing it over to cleaned_uri_string?

As a proof of concept, something along the lines of:

"\x80fdsafdsa\x7f漢".chars.map do |c| (c.bytes.length == 1 && c.bytes.first >= 128) ? CGI.escape(c) : c end.join

Which would give you "%80fdsafdsa\u007F漢". cleaned_uri_string would then strip out %80 from the string.

Would you like us to file a separate pull request to review?

— Reply to this email directly or view it on GitHub https://github.com/singlebrook/utf8-cleaner/pull/21#issuecomment-67293098 .

pkang commented 9 years ago

So we understand that it's surprising that these characters are coming in not %-encoded. Yes, in theory, this should not be happening; in practice, it is. That threw us for a loop too, since we thought utf8-cleaner was supposed to get rid of all of our ArgumentError: invalid byte sequence in UTF-8 errors.

We had considered that perhaps our changes were outside of the scope of the utf8-cleaner gem, but then reconsidered since it is called "utf8-cleaner" and thought that maybe you'd want to see if you'd like to handle this particular, real-world scenario.

Our Honeybadger traces, as well as manually investigation, show that our Rack::Cache middleware chokes immediately after UTF8Cleaner::Middleware, which pretty much sits on top of the middleware stack, other than Thin (in development) / Unicorn (in production) , New Relic, Honeybadger.

If you believe this is outside of the scope of your cleaner, feel free to close the pull request and we'll just manually patch our issue.

sbleon commented 9 years ago

I think it's squarely in the scope of the gem. I just want to be really careful about how we approach it, taking the most conservative approach possible. I also want to make sure that the various classes are keeping to their own single responsibilities.

I'd like to be able to reproduce the error to help determine the most appropriate solution. Can you come up with a curl command that causes the crash you're seeing in your app, then post the command and the error backtrace?

On Wed, Dec 17, 2014 at 10:38 AM, Pablo Kang notifications@github.com wrote:

So we understand that it's surprising that these characters are coming in not %-encoded. Yes, in theory, this should not be happening; in practice, it is. That threw us for a loop too, since we thought utf8-cleaner was supposed to get rid of all of our ArgumentError: invalid byte sequence in UTF-8 errors.

We had considered that perhaps our changes were outside of the scope of the utf8-cleaner gem, but then reconsidered since it is called "utf8-cleaner" and thought that maybe you'd want to see if you'd like to handle this particular, real-world scenario.

Our Honeybadger traces, as well as manually investigation, show that our Rack::Cache middleware chokes immediately after UTF8Cleaner::Middleware, which pretty much sits on top of the middleware stack, other than Thin (in development) / Unicorn (in production) , New Relic, Honeybadger.

If you believe this is outside of the scope of your cleaner, feel free to close the pull request and we'll just manually patch our issue.

Reply to this email directly or view it on GitHub https://github.com/singlebrook/utf8-cleaner/pull/21#issuecomment-67339803 .

sbleon commented 8 years ago

Using tidy_bytes on all input solves the smart-quotes problem, so I think this PR is no longer necessary.

singlebrook / utf8-cleaner

Remove invalid UTF-8 characters #21