Closed john-parton closed 6 years ago
The problem here is that the question marks represent missing data.
Presumably the left and right curly apostrophes were being mis-decoded as 'â\x80\x98'
and 'â\x80\x99'
-- that is, the two characters after the â
are control characters -- and then the control characters were replaced by question marks.
But in this case alone there are 1024 different characters that could have become â??
, so there's nothing that ftfy can do to recover the text.
Thanks for addressing my issue so quickly.
I have hundreds of megabytes of text that has that sequence, and as far as I can tell, it's always supposed to be '
, so I'll just add in a post-processing step after fix_text
.
I realize this is a tiny issue, but I found a few places where a single-quote or apostrophe
'
was getting mangled asâ??
. I'm not sure if there's a general fix for this, or if just a corner case that I need to be aware of.Real page where that happens: http://www.metrolyrics.com/oh-girl-lyrics-various.html
The text on the page:
They say, â??Let the woman take care of youâ??
Here's another example:
Here's where I got the previous snippet: http://www.metrolyrics.com/when-will-the-good-apples-fall-lyrics-seekers.html
Let me know if you need more examples. Thanks.