Closed sbleon closed 8 years ago
Ditto here. Some invalid UTF-8 inside rack.request.form_vars
still triggers an exception with an application using UTF-8 cleaner.
+1
Hey, I have an interest in this issue and want to see if I can contribute.
In my Rails application, which is being crawled by a Chinese spider that POSTs invalid strings to it, invalid UTF-8 isn't being removed from rack.input
in the environment. This seems to be because the string is not URI encoded, even though the content-type of the request is application/x-www-form-urlencoded. This is problematic because it skips the whole section of utf-8 cleaner that removes invalid characters (in #cleaned_uri_string).
I can reproduce this by running my app using utf-8 cleaner locally, and watching the server throw an exception as I CURL it with some invalid UTF-8 POST data that's not URI-encoded, like in this Ruby script:
invalid = "data\xed\xe5\xed\xe0".force_encoding('ASCII-8BIT')
`curl localhost:3004 -d #{invalid}`
I made a commit which removes invalid characters from strings that are not URI-encoded. It works with my applications. I wanted to know if there are problems with doing this that I'm not aware of, or something key that I'm not understanding.
In my comment above I mentioned seeing the exception-raising UTF-8 in rack.request.form_vars
, but after looking at the Rack source code I realized that this value holds the same value as rack.input
and stays unparsed.
Let me know if you have any thoughts, or if there is a way I can help with a pull request at some point.
I'm also seeing issues from a Chinese spider (EasouSpider, right?).
Saw a Stack Overflow question (not by me) for the same problem: http://stackoverflow.com/questions/24648206/ruby-on-rails-invalid-byte-sequence-in-utf-8-due-to-bot
utf8-cleaner doesn't currently help there. But I'm giving @reprah's solution a go. Thank you!
Deployed @reprah's solution last night. So far I believe we've had less errors than before, but we have gotten one "ArgumentError: invalid byte sequence in UTF-8" from EasouSpider after deploying it. I don't see anything obviously odd in the params that our error logger (Honeybadger) shows us. I'll see if I can get better logs somehow.
To attack this problem from both sides, I just mailed service@staff.easou.com:
Hello,
Are you aware that your EasouSpider is causing errors in a lot of Ruby on Rails apps?
See e.g. http://stackoverflow.com/questions/24648206/ruby-on-rails-invalid-byte-sequence-in-utf-8-due-to-bot and https://github.com/ singlebrook/utf8-cleaner/issues/18
Maybe it's something you want to look into.
Looking at the "invalid byte sequence" error we had, it seems to involve this line of Rails code which suggests rack.request.form_hash/rack.request.form_vars is involved. But those seem to be derived from rack.input, which cleaner cleans, so I'm not sure where it goes wrong.
More backtrace in case it helps: https://gist.github.com/henrik/409e989abc62870f5b1b
I've also seen the error @sbleon mentioned but as "ArgumentError: invalid %-encoding".
Full error with backtrace: https://gist.github.com/henrik/0b14a39aa86b1e49c8a4
More reading: https://github.com/rack/rack/issues/337
Supposedly solved elsewhere: https://github.com/whitequark/rack-utf8_sanitizer/pull/15 Will try.
@henrik we've solved all the issues we know of.. in rack-utf8_sanitizer.. :)
but the rack people say the invalid %-encoding should be handled at the web server level
I'm trying out rack-utf8_sanitizer on some of my sites. If it works well, I'll probably recommend that users switch to it.
On Thu, Jul 10, 2014 at 11:02 AM, Benjamin Fleischer < notifications@github.com> wrote:
@henrik https://github.com/henrik we've solved all the issues we know of.. in rack-utf8_sanitizer.. :)
but the rack people say the invalid %-encoding should be handled at the web server level https://github.com/rack/rack/issues/337#issuecomment-46453404
Reply to this email directly or view it on GitHub https://github.com/singlebrook/utf8-cleaner/issues/18#issuecomment-48616637 .
@henrik thanks for helping everyone investigate this. I'm watching the discussions/repos you linked to... it'll be interesting to see how this whole thing plays out.
I'm still getting invalid UTF8 errors from EasouSpider after switching to rack-utf8-sanitizer.
Because this is all from one apparently-misbehaved agent, I'm considering returning some kind of 4xx error for this condition instead of trying to "clean" the data. Thoughts on that?
On Jul 10, 2014, at 7:44 PM, reprah notifications@github.com wrote:
@henrik thanks for helping everyone investigate this. I'm watching the discussions/repos you linked to... it'll be interesting to see how this whole thing plays out.
— Reply to this email directly or view it on GitHub.
@sbleon Yeah, I did too, so I'm using rack-utf8_sanitizer plus some middleware by @bf4 that's supposed to fix the rest. Added the middleware just a few hours ago so it's soon to say if it's fixed the issue.
@henrik fwiw, in the actual version of the middleware, I use ActionDispatch::Request, request.formats.first, and ActionDispatch::Response.default_charset
@bf4 Thanks. Would request.formats.first
be instead of DEFAULT_CONTENT_TYPE
?
I'm running it as
config.middleware.insert_before "Rack::Runtime", 'ExceptionApp'
config.middleware.insert_before "Rack::Runtime", Rack::UTF8Sanitizer
(haven't changed the 'Rack::Runtime' to 0 yet)
I added the 'Rails' version of the middleware in a comment https://gist.github.com/bf4/d26259acfa29f3b9882b#comment-1262675 I don't think it should make a difference, though.
@sbleon, will @reprah's fix be incorporated into utf8-cleaner? If not, what solution do you recommend?
@nickhoffman I'd recommend doing this: http://stackoverflow.com/a/24727310/6962
It's just a writeup of what I said above. Haven't seen a single error for over 24 hours.
Thanks, @henrik. I'm still curious to know if utf8-cleaner will be updated to deal with this type of HTTP request.
It's unlikely to happen soon, as I'm swamped, but if someone wants to construct a PR, I'm happy to review it.
On Tue, Jul 15, 2014 at 9:48 AM, Nick Hoffman notifications@github.com wrote:
Thanks, @henrik https://github.com/henrik. I'm still curious to know if utf8-cleaner will be updated to deal with this type of HTTP request.
Reply to this email directly or view it on GitHub https://github.com/singlebrook/utf8-cleaner/issues/18#issuecomment-49033594 .
It's not a utf8 issue. It's Ruby's URI module
@bf4 your solution worked well for me. Thanks!
@henri, thanks for linking to http://stackoverflow.com/a/24727310/6962 . Since that solution involves inserting middleware before rack-utf8_sanitizer, why do you recommend rack-utf8_sanitizer instead of utf8-cleaner?
@nickhoffman I've never tried @bf4's middleware together with utf8-cleaner – I just happened to try switching to rack-utf8_sanitizer, then added on the middleware, and then the problem was solved. If utf8-cleaner + middleware works well, that's great to hear. Add a comment to that effect on http://stackoverflow.com/a/24727310/6962 if you like.
Really, just look at the implementation and see which one you like better
@henrik Thanks for that writeup in stackoverflow. It worked for me!
I think our use of tidy_bytes
solves this issue.
I think this might be the culprit: