singlebrook / utf8-cleaner

MIT License
277 stars 44 forks source link

Not encoded url string should be invalid. #7

Closed lulalala closed 8 years ago

lulalala commented 10 years ago

Related to #6, some request does not encode url string at all, but just pass in Unicode characters as is. These should be invalid.

lulalala commented 10 years ago

Just realized your gem not only checks url strings but also cookies, so this PR will break that.

murphyslaw commented 10 years ago

:+1: Would like to see this fixed.

sbleon commented 10 years ago

What is happening when the non-encoded characters are received by the server? Is Ruby choking on them?

What do the RFCs say about how non-encoded characters should be handled?

On Wed, Feb 19, 2014 at 8:41 AM, Falk Köppe notifications@github.comwrote:

[image: :+1:] Would like to see this fixed.

Reply to this email directly or view it on GitHubhttps://github.com/singlebrook/utf8-cleaner/pull/7#issuecomment-35494544 .

murphyslaw commented 10 years ago

It should be RFC 3986. http://tools.ietf.org/html/rfc3986 Have to read the important passages a few more times, before I would dare an answer.

My first guess is that the encoding of non-encoded characters should be guessed by the URI handling application. So it is fine to assume UTF-8 in our case. Usually it should comply to UTF-8 but there probably is a chance to guess a different URI encoding by trying other popular encodings if it is not UTF-8. But that seems to be out of scope of the gem.

Google seems to handle the "same" URI encoded differently to the non-encoded URI. https://productforums.google.com/forum/#!topic/webmasters/n6lKwQjiNHs

Therefore I would say it is good to interpret non-encoded URL as different and invalid. I also feel like this is what users of the gem expect. But please double check this, I'm not an expert on this topic.

lulalala commented 10 years ago

@sbleon for me, Ruby will raise error at this line:

    # invalid byte sequence in UTF-8
    # ruby-1.9.3-p448/lib/ruby/1.9.1/cgi/util.rb:16 in tr
    str=string.tr('+', ' ').force_encoding(Encoding::ASCII_8BIT).gsub(/((?:%[0-9a-fA-F]{2})+)/) do

Though it was due to invalid encoding, I thought forcing ASCII would fix it anyways, hence this PR.

However the gem also does the checking on cookies and referral uri, which I think my PR should not apply to them. However I have no come up with an elegant way to do this. So in my fork I just ignored cookies completely.

oggy commented 10 years ago

Perhaps a lighter touch would be to just replace invalid utf chars in the original string before checking for the '%' and URI.decode-ing it. Would such a patch be accepted?

lulalala commented 10 years ago

@oggy No, because these are valid utf-8 characters. It's just that UTF-8 characters are not valid in a uri. You can find all non-ascii characters and encode them in % form, and then do the checking.

oggy commented 10 years ago

@lulalala Thanks, I misread the original description.

I do think my proposal is valid, but solves a different issue. Will work on a PR.

sbleon commented 10 years ago

@lulalala the build failed in Ruby 1.9.3. You might need to add the file encoding comment at the top of any files that contains UTF8 chars.

lulalala commented 10 years ago

@sbleon Thanks, you are right. I have added it now.

Oh as a reminder, even though all tests passed, this patch will strip out unicode characters in cookies (which may not be what you want because Chrome stores unicode characters as utf-8).

sbleon commented 10 years ago

Thanks, @lulalala.

I re-read the URI RFC and while I agree that UTF-8 characters are invalid in the URL, I think that removing them is beyond the scope of this gem. This gem really aims to prevent the application from crashing due to malformed UTF-8.

Your specs don't show Ruby or Rails crashing when you pass it valid UTF-8, and I haven't been able to get mine to crash. Can you create a spec (in middleware_spec.rb that shows the crash you're talking about)?

PS. I did do a little bit of research into the best regex, and if we do need to test for "only valid characters" I would use

%r(^[-\w.~:/?#\[\]@!$&'\(\)*+,;=]*$) =~ string
lulalala commented 10 years ago

Hi @sbleon My first commit contains both the spec and the fix. So if you comment out this line: https://github.com/lulalala/utf8-cleaner/blob/92006f2cefb670d866f891c881536315253ffee1/lib/utf8-cleaner/uri_string.rb#L74 And then run rspec, it will fail. (P.S. This example was taken from my log, it is gibberish and I am not sure if it is valid UTF-8 characters.)

What I did in my fork is to raise 400 error instead of sanitizing the params: https://github.com/lulalala/utf8-cleaner/commit/682f4ad11f77838e85a15c23845b112666206fde I totally understand that you think this is outside the scope. So don't worry if you close this. It's just that I saw issue #6 and think this can fix that.

sbleon commented 10 years ago

Hm. I see now that's it's hard to write a spec that shows Rails or Rack crashing when we're not actually using Rails when running the specs!

@lulalala can you maybe craft a curl command that crashes your Rails app and paste it here, along with the backtrace?

sbleon commented 8 years ago

Closing due to inactivity and out-of-scope-ness.