singlebrook / utf8-cleaner

MIT License
277 stars 44 forks source link

Is UTF8-cleaner the right solution for this issue? #5

Closed helloluis closed 10 years ago

helloluis commented 10 years ago

We've got a lot of user input that occasionally happens to be in the wrong encoding format (I'm not sure how it happens, and I'm not 100% sure why it doesn't get forced into UTF-8 by MongoDB). Recently I've taken to using a forced encode like below on strings that need to be displayed, but as you can imagine this is an untenable solution for anything except the most limited scenarios.

str.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

Is this the kind of situation that UTF8-cleaner was built to fix? Or does it only work for incoming strings?

sbleon commented 10 years ago

The utf8-cleaner middleware operates only on incoming strings, and currently only handles URI-encoded strings. The gem could be enhanced to clean up non-URI-encoded strings, and the utility classes could also be used outside of the middleware (e.g. for displayed questionable data). However, in its current state it doesn't address your particular issue.

On Fri, Oct 18, 2013 at 6:23 AM, Luis Buenaventura <notifications@github.com

wrote:

We've got a lot of user input that occasionally happens to be in the wrong encoding format (I'm not sure how it happens, and I'm not 100% sure why it doesn't get forced into UTF-8 by MongoDB). Recently I've taken to using a forced encode like below on strings that need to be displayed, but as you can imagine this is an untenable solution for anything except the most limited scenarios.

str.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

Is this the kind of situation that UTF8-cleaner was built to fix? Or does it only work for incoming strings?

— Reply to this email directly or view it on GitHubhttps://github.com/singlebrook/utf8-cleaner/issues/5 .