Closed vovimayhem closed 9 years ago
Sounds like a good addition! Go for it!
On Mon, Apr 20, 2015 at 12:45 PM, Roberto Quintanilla < notifications@github.com> wrote:
Hi! I'm currently working on a website with a lot of traffic in Latin America.
We've found some agents that publish the "User-Agent" header with invalid characters, making our app crash. We rolled up a fix on our app to re-process this header with the corrected string, before noticing this library.
I'm reviewing, however, that in your middleware class, your'e not sanitizing the User-Agent header / HTTP_USER_AGENT env key https://github.com/singlebrook/utf8-cleaner/blob/ce98e8f00f026b9d9aef95b39f7f75a842b9a6a5/lib/utf8-cleaner/middleware.rb#L4 .
It's that on purpose? I'm preparing a PR in case it's desired.
Reply to this email directly or view it on GitHub https://github.com/singlebrook/utf8-cleaner/issues/23.
I'm having trouble right now with two use cases. In both, we're noticing the '�' character in the string, before our app crashes when doing something useful with the string:
Use Case 1: Whenever I get this User-Agent string from god-knows-which cellphone model:
require 'json'
# This is how I've managed to replicate the User-Agent string this agent is sending to our servers:
problem_user_agent = "Mozilla/5.0 (Linux; U; Android 2.3.6; es-co; XT320 Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Versi\xF3n/4.0 Mobile Safari/533.1".force_encoding(Encoding::UTF_8) # Notice the '\xF3' part, should read "Versión"
# Watch this fail... the same happens whenever I try to add this string to our database:
JSON.generate(user_agent: problem_user_agent)
Use Case 2: Whenever the agent sends unescaped URL query strings:
require "json"
# This is how I've managed to replicate the query string this agent is sending to our servers:
problem_url = "http://www.example.com/search?terms=M\xE9xico".force_encoding(Encoding::UTF_8) # should read "México"
# Again watch this fail...
JSON.generate(url: problem_url)
The best way I came up with to process these strings AND preserve the data that was originally meant is by using ActiveSupport core extensions for the String class, like this:
require "json"
require "active_support/core_ext/string/multibyte"
problem_user_agent = "Mozilla/5.0 (Linux; U; Android 2.3.6; es-co; XT320 Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Versi\xF3n/4.0 Mobile Safari/533.1".force_encoding(Encoding::UTF_8)
processed_user_agent = problem_user_agent.mb_chars.tidy_bytes.to_s
# Now watch it work:
JSON.generate(user_agent: processed_user_agent)
It makes sense for me to do this before the processing utf8-cleaner does. But there are 2 issues if this functionality is to be included in this gem:
I don't know at this point if it's useful for users of this gem to include this (if so, I can PR), but in the meantime I'll just create a new Rack Middleware class to process the data and then let the utf8-cleaner to do it's stuff.
Gist with the Rack Middleware I generated: https://gist.github.com/vovimayhem/f7bc4b8f06fefa798374
I didn't know about tidy_bytes
. That is awesome!
It looks like you're tackling a couple of different problems:
I'd like to see both of these incorporated into this gem, with suitable test cases, and probably in separate PRs.
@vovimayhem I believe this addressed in the new version v0.1.1. Thanks for the suggestion about tidy_bytes
!
Great!, I go to update because I have the same problem. Thanks!
Hi! I'm currently working on a website with a lot of traffic in Latin America.
We've found some agents that publish the "User-Agent" header with invalid characters, making our app crash. We rolled up a fix on our app to re-process this header with the corrected string, before noticing this library.
I'm reviewing, however, that in your middleware class, your'e not sanitizing the User-Agent header / HTTP_USER_AGENT env key.
It's that on purpose? I'm preparing a PR in case it's desired.