Closed lulalala closed 8 years ago
Just realized your gem not only checks url strings but also cookies, so this PR will break that.
:+1: Would like to see this fixed.
What is happening when the non-encoded characters are received by the server? Is Ruby choking on them?
What do the RFCs say about how non-encoded characters should be handled?
On Wed, Feb 19, 2014 at 8:41 AM, Falk Köppe notifications@github.comwrote:
[image: :+1:] Would like to see this fixed.
Reply to this email directly or view it on GitHubhttps://github.com/singlebrook/utf8-cleaner/pull/7#issuecomment-35494544 .
It should be RFC 3986. http://tools.ietf.org/html/rfc3986 Have to read the important passages a few more times, before I would dare an answer.
My first guess is that the encoding of non-encoded characters should be guessed by the URI handling application. So it is fine to assume UTF-8 in our case. Usually it should comply to UTF-8 but there probably is a chance to guess a different URI encoding by trying other popular encodings if it is not UTF-8. But that seems to be out of scope of the gem.
Google seems to handle the "same" URI encoded differently to the non-encoded URI. https://productforums.google.com/forum/#!topic/webmasters/n6lKwQjiNHs
Therefore I would say it is good to interpret non-encoded URL as different and invalid. I also feel like this is what users of the gem expect. But please double check this, I'm not an expert on this topic.
@sbleon for me, Ruby will raise error at this line:
# invalid byte sequence in UTF-8
# ruby-1.9.3-p448/lib/ruby/1.9.1/cgi/util.rb:16 in tr
str=string.tr('+', ' ').force_encoding(Encoding::ASCII_8BIT).gsub(/((?:%[0-9a-fA-F]{2})+)/) do
Though it was due to invalid encoding, I thought forcing ASCII would fix it anyways, hence this PR.
However the gem also does the checking on cookies and referral uri, which I think my PR should not apply to them. However I have no come up with an elegant way to do this. So in my fork I just ignored cookies completely.
Perhaps a lighter touch would be to just replace invalid utf chars in the original string before checking for the '%' and URI.decode-ing it. Would such a patch be accepted?
@oggy No, because these are valid utf-8 characters. It's just that UTF-8 characters are not valid in a uri. You can find all non-ascii characters and encode them in % form, and then do the checking.
@lulalala Thanks, I misread the original description.
I do think my proposal is valid, but solves a different issue. Will work on a PR.
@lulalala the build failed in Ruby 1.9.3. You might need to add the file encoding comment at the top of any files that contains UTF8 chars.
@sbleon Thanks, you are right. I have added it now.
Oh as a reminder, even though all tests passed, this patch will strip out unicode characters in cookies (which may not be what you want because Chrome stores unicode characters as utf-8).
Thanks, @lulalala.
I re-read the URI RFC and while I agree that UTF-8 characters are invalid in the URL, I think that removing them is beyond the scope of this gem. This gem really aims to prevent the application from crashing due to malformed UTF-8.
Your specs don't show Ruby or Rails crashing when you pass it valid UTF-8, and I haven't been able to get mine to crash. Can you create a spec (in middleware_spec.rb
that shows the crash you're talking about)?
PS. I did do a little bit of research into the best regex, and if we do need to test for "only valid characters" I would use
%r(^[-\w.~:/?#\[\]@!$&'\(\)*+,;=]*$) =~ string
Hi @sbleon My first commit contains both the spec and the fix. So if you comment out this line: https://github.com/lulalala/utf8-cleaner/blob/92006f2cefb670d866f891c881536315253ffee1/lib/utf8-cleaner/uri_string.rb#L74 And then run rspec, it will fail. (P.S. This example was taken from my log, it is gibberish and I am not sure if it is valid UTF-8 characters.)
What I did in my fork is to raise 400 error instead of sanitizing the params: https://github.com/lulalala/utf8-cleaner/commit/682f4ad11f77838e85a15c23845b112666206fde I totally understand that you think this is outside the scope. So don't worry if you close this. It's just that I saw issue #6 and think this can fix that.
Hm. I see now that's it's hard to write a spec that shows Rails or Rack crashing when we're not actually using Rails when running the specs!
@lulalala can you maybe craft a curl
command that crashes your Rails app and paste it here, along with the backtrace?
Closing due to inactivity and out-of-scope-ness.
Related to #6, some request does not encode url string at all, but just pass in Unicode characters as is. These should be invalid.