singlebrook / utf8-cleaner

MIT License
277 stars 44 forks source link

Some form inputs still result in "invalid byte sequence" #18

Closed sbleon closed 8 years ago

sbleon commented 10 years ago

I think this might be the culprit:

rack.request.form_vars: aliCert Validation Network1 0   U
    ValiCert, Inc.1503  U   ,ValiCert Class 2 Policy Validation Authority1!0   U    http://www.valicert.com/1 0         *†H†÷
           info@valicert.com‚
reprah commented 10 years ago

Ditto here. Some invalid UTF-8 inside rack.request.form_vars still triggers an exception with an application using UTF-8 cleaner.

nikue commented 10 years ago

+1

reprah commented 10 years ago

Hey, I have an interest in this issue and want to see if I can contribute.

In my Rails application, which is being crawled by a Chinese spider that POSTs invalid strings to it, invalid UTF-8 isn't being removed from rack.input in the environment. This seems to be because the string is not URI encoded, even though the content-type of the request is application/x-www-form-urlencoded. This is problematic because it skips the whole section of utf-8 cleaner that removes invalid characters (in #cleaned_uri_string).

I can reproduce this by running my app using utf-8 cleaner locally, and watching the server throw an exception as I CURL it with some invalid UTF-8 POST data that's not URI-encoded, like in this Ruby script:

invalid = "data\xed\xe5\xed\xe0".force_encoding('ASCII-8BIT')
`curl localhost:3004 -d #{invalid}`

I made a commit which removes invalid characters from strings that are not URI-encoded. It works with my applications. I wanted to know if there are problems with doing this that I'm not aware of, or something key that I'm not understanding.

In my comment above I mentioned seeing the exception-raising UTF-8 in rack.request.form_vars, but after looking at the Rack source code I realized that this value holds the same value as rack.input and stays unparsed.

Let me know if you have any thoughts, or if there is a way I can help with a pull request at some point.

henrik commented 10 years ago

I'm also seeing issues from a Chinese spider (EasouSpider, right?).

Saw a Stack Overflow question (not by me) for the same problem: http://stackoverflow.com/questions/24648206/ruby-on-rails-invalid-byte-sequence-in-utf-8-due-to-bot

utf8-cleaner doesn't currently help there. But I'm giving @reprah's solution a go. Thank you!

henrik commented 10 years ago

Deployed @reprah's solution last night. So far I believe we've had less errors than before, but we have gotten one "ArgumentError: invalid byte sequence in UTF-8" from EasouSpider after deploying it. I don't see anything obviously odd in the params that our error logger (Honeybadger) shows us. I'll see if I can get better logs somehow.

henrik commented 10 years ago

To attack this problem from both sides, I just mailed service@staff.easou.com:

Hello,

Are you aware that your EasouSpider is causing errors in a lot of Ruby on Rails apps?

See e.g. http://stackoverflow.com/questions/24648206/ruby-on-rails-invalid-byte-sequence-in-utf-8-due-to-bot and https://github.com/ singlebrook/utf8-cleaner/issues/18

Maybe it's something you want to look into.

henrik commented 10 years ago

Looking at the "invalid byte sequence" error we had, it seems to involve this line of Rails code which suggests rack.request.form_hash/rack.request.form_vars is involved. But those seem to be derived from rack.input, which cleaner cleans, so I'm not sure where it goes wrong.

More backtrace in case it helps: https://gist.github.com/henrik/409e989abc62870f5b1b

henrik commented 10 years ago

I've also seen the error @sbleon mentioned but as "ArgumentError: invalid %-encoding".

Full error with backtrace: https://gist.github.com/henrik/0b14a39aa86b1e49c8a4

henrik commented 10 years ago

More reading: https://github.com/rack/rack/issues/337

Supposedly solved elsewhere: https://github.com/whitequark/rack-utf8_sanitizer/pull/15 Will try.

bf4 commented 10 years ago

@henrik we've solved all the issues we know of.. in rack-utf8_sanitizer.. :)

but the rack people say the invalid %-encoding should be handled at the web server level

sbleon commented 10 years ago

I'm trying out rack-utf8_sanitizer on some of my sites. If it works well, I'll probably recommend that users switch to it.

On Thu, Jul 10, 2014 at 11:02 AM, Benjamin Fleischer < notifications@github.com> wrote:

@henrik https://github.com/henrik we've solved all the issues we know of.. in rack-utf8_sanitizer.. :)

but the rack people say the invalid %-encoding should be handled at the web server level https://github.com/rack/rack/issues/337#issuecomment-46453404

Reply to this email directly or view it on GitHub https://github.com/singlebrook/utf8-cleaner/issues/18#issuecomment-48616637 .

reprah commented 10 years ago

@henrik thanks for helping everyone investigate this. I'm watching the discussions/repos you linked to... it'll be interesting to see how this whole thing plays out.

sbleon commented 10 years ago

I'm still getting invalid UTF8 errors from EasouSpider after switching to rack-utf8-sanitizer.

Because this is all from one apparently-misbehaved agent, I'm considering returning some kind of 4xx error for this condition instead of trying to "clean" the data. Thoughts on that?

On Jul 10, 2014, at 7:44 PM, reprah notifications@github.com wrote:

@henrik thanks for helping everyone investigate this. I'm watching the discussions/repos you linked to... it'll be interesting to see how this whole thing plays out.

— Reply to this email directly or view it on GitHub.

henrik commented 10 years ago

@sbleon Yeah, I did too, so I'm using rack-utf8_sanitizer plus some middleware by @bf4 that's supposed to fix the rest. Added the middleware just a few hours ago so it's soon to say if it's fixed the issue.

bf4 commented 10 years ago

@henrik fwiw, in the actual version of the middleware, I use ActionDispatch::Request, request.formats.first, and ActionDispatch::Response.default_charset

henrik commented 10 years ago

@bf4 Thanks. Would request.formats.first be instead of DEFAULT_CONTENT_TYPE?

bf4 commented 10 years ago

I'm running it as

  config.middleware.insert_before "Rack::Runtime", 'ExceptionApp'
  config.middleware.insert_before "Rack::Runtime", Rack::UTF8Sanitizer

(haven't changed the 'Rack::Runtime' to 0 yet)

I added the 'Rails' version of the middleware in a comment https://gist.github.com/bf4/d26259acfa29f3b9882b#comment-1262675 I don't think it should make a difference, though.

nickhoffman commented 10 years ago

@sbleon, will @reprah's fix be incorporated into utf8-cleaner? If not, what solution do you recommend?

henrik commented 10 years ago

@nickhoffman I'd recommend doing this: http://stackoverflow.com/a/24727310/6962

It's just a writeup of what I said above. Haven't seen a single error for over 24 hours.

nickhoffman commented 10 years ago

Thanks, @henrik. I'm still curious to know if utf8-cleaner will be updated to deal with this type of HTTP request.

sbleon commented 10 years ago

It's unlikely to happen soon, as I'm swamped, but if someone wants to construct a PR, I'm happy to review it.

On Tue, Jul 15, 2014 at 9:48 AM, Nick Hoffman notifications@github.com wrote:

Thanks, @henrik https://github.com/henrik. I'm still curious to know if utf8-cleaner will be updated to deal with this type of HTTP request.

Reply to this email directly or view it on GitHub https://github.com/singlebrook/utf8-cleaner/issues/18#issuecomment-49033594 .

bf4 commented 10 years ago

It's not a utf8 issue. It's Ruby's URI module

sbleon commented 10 years ago

@bf4 your solution worked well for me. Thanks!

nickhoffman commented 10 years ago

@henri, thanks for linking to http://stackoverflow.com/a/24727310/6962 . Since that solution involves inserting middleware before rack-utf8_sanitizer, why do you recommend rack-utf8_sanitizer instead of utf8-cleaner?

henrik commented 10 years ago

@nickhoffman I've never tried @bf4's middleware together with utf8-cleaner – I just happened to try switching to rack-utf8_sanitizer, then added on the middleware, and then the problem was solved. If utf8-cleaner + middleware works well, that's great to hear. Add a comment to that effect on http://stackoverflow.com/a/24727310/6962 if you like.

bf4 commented 10 years ago

Really, just look at the implementation and see which one you like better

paologlim commented 10 years ago

@henrik Thanks for that writeup in stackoverflow. It worked for me!

sbleon commented 8 years ago

I think our use of tidy_bytes solves this issue.