Invalid byte sequence in User-Agent request header

vovimayhem commented 9 years ago

Hi! I'm currently working on a website with a lot of traffic in Latin America.

We've found some agents that publish the "User-Agent" header with invalid characters, making our app crash. We rolled up a fix on our app to re-process this header with the corrected string, before noticing this library.

I'm reviewing, however, that in your middleware class, your'e not sanitizing the User-Agent header / HTTP_USER_AGENT env key.

It's that on purpose? I'm preparing a PR in case it's desired.

sbleon commented 9 years ago

Sounds like a good addition! Go for it!

On Mon, Apr 20, 2015 at 12:45 PM, Roberto Quintanilla < notifications@github.com> wrote:

Hi! I'm currently working on a website with a lot of traffic in Latin America.

We've found some agents that publish the "User-Agent" header with invalid characters, making our app crash. We rolled up a fix on our app to re-process this header with the corrected string, before noticing this library.

I'm reviewing, however, that in your middleware class, your'e not sanitizing the User-Agent header / HTTP_USER_AGENT env key https://github.com/singlebrook/utf8-cleaner/blob/ce98e8f00f026b9d9aef95b39f7f75a842b9a6a5/lib/utf8-cleaner/middleware.rb#L4 .

It's that on purpose? I'm preparing a PR in case it's desired.

Reply to this email directly or view it on GitHub https://github.com/singlebrook/utf8-cleaner/issues/23.

vovimayhem commented 9 years ago

I'm having trouble right now with two use cases. In both, we're noticing the '�' character in the string, before our app crashes when doing something useful with the string:

Use Case 1: Whenever I get this User-Agent string from god-knows-which cellphone model:

require 'json'

# This is how I've managed to replicate the User-Agent string this agent is sending to our servers:
problem_user_agent = "Mozilla/5.0 (Linux; U; Android 2.3.6; es-co; XT320 Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Versi\xF3n/4.0 Mobile Safari/533.1".force_encoding(Encoding::UTF_8) # Notice the '\xF3' part, should read "Versión"

# Watch this fail... the same happens whenever I try to add this string to our database:
JSON.generate(user_agent: problem_user_agent)

Use Case 2: Whenever the agent sends unescaped URL query strings:

require "json"

# This is how I've managed to replicate the query string this agent is sending to our servers:
problem_url = "http://www.example.com/search?terms=M\xE9xico".force_encoding(Encoding::UTF_8) # should read "México"

# Again watch this fail...
JSON.generate(url: problem_url)

The best way I came up with to process these strings AND preserve the data that was originally meant is by using ActiveSupport core extensions for the String class, like this:

require "json"
require "active_support/core_ext/string/multibyte"

problem_user_agent = "Mozilla/5.0 (Linux; U; Android 2.3.6; es-co; XT320 Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Versi\xF3n/4.0 Mobile Safari/533.1".force_encoding(Encoding::UTF_8)

processed_user_agent = problem_user_agent.mb_chars.tidy_bytes.to_s

# Now watch it work:
JSON.generate(user_agent: processed_user_agent)

It makes sense for me to do this before the processing utf8-cleaner does. But there are 2 issues if this functionality is to be included in this gem:

Will require activesupport as a dependency
The performance of the 'tidy_bytes' method implementation in activesupport version < 4.1 is slow compared to >= 4.1, but only if using ruby >= 2.1

I don't know at this point if it's useful for users of this gem to include this (if so, I can PR), but in the meantime I'll just create a new Rack Middleware class to process the data and then let the utf8-cleaner to do it's stuff.

vovimayhem commented 9 years ago

Gist with the Rack Middleware I generated: https://gist.github.com/vovimayhem/f7bc4b8f06fefa798374

sbleon commented 9 years ago

I didn't know about tidy_bytes. That is awesome!

It looks like you're tackling a couple of different problems:

Fixing non-ASCII characters in some common non-UTF8 encodings.
URI-encoding inputs that should have been encoded by the user agent but were not.

I'd like to see both of these incorporated into this gem, with suitable test cases, and probably in separate PRs.

sbleon commented 9 years ago

@vovimayhem I believe this addressed in the new version v0.1.1. Thanks for the suggestion about tidy_bytes!

007lva commented 9 years ago

Great!, I go to update because I have the same problem. Thanks!

singlebrook / utf8-cleaner

Invalid byte sequence in User-Agent request header #23