rspeer / python-ftfy

Fixes mojibake and other glitches in Unicode text, after the fact.
http://ftfy.readthedocs.org
Other
3.81k stars 120 forks source link

Unfixed mojibake: â?? #83

Closed john-parton closed 6 years ago

john-parton commented 6 years ago

I realize this is a tiny issue, but I found a few places where a single-quote or apostrophe ' was getting mangled as â??. I'm not sure if there's a general fix for this, or if just a corner case that I need to be aware of.

Real page where that happens: http://www.metrolyrics.com/oh-girl-lyrics-various.html

The text on the page: They say, â??Let the woman take care of youâ??

Here's another example:

>>> ftfy.fix_text("See loverâ??s hand in hand walk by")
'See loverâ??s hand in hand walk by'

Here's where I got the previous snippet: http://www.metrolyrics.com/when-will-the-good-apples-fall-lyrics-seekers.html

Let me know if you need more examples. Thanks.

rspeer commented 6 years ago

The problem here is that the question marks represent missing data.

Presumably the left and right curly apostrophes were being mis-decoded as 'â\x80\x98' and 'â\x80\x99' -- that is, the two characters after the â are control characters -- and then the control characters were replaced by question marks.

But in this case alone there are 1024 different characters that could have become â??, so there's nothing that ftfy can do to recover the text.

john-parton commented 6 years ago

Thanks for addressing my issue so quickly.

I have hundreds of megabytes of text that has that sequence, and as far as I can tell, it's always supposed to be ', so I'll just add in a post-processing step after fix_text.