Closed GoogleCodeExporter closed 9 years ago
As a workaround, I have the following code that uses
http://chardet.feedparser.org/ to check for strange encodings:
import chardet
def fix_ffdl_encoding(data):
''' Deal with utf8 encoded as a unicode object, amongst others
Doesn't deal with lists of objects'''
if not isinstance(data, unicode):
return data
try:
endata = data.encode('cp1252') #standard windows western encoding
except UnicodeEncodeError:
#chances are it's utf8, or nor a string
return data
results = chardet.detect(endata)
if results['confidence'] > 0.8:
return unicode(endata, results['encoding'])
else:
return data
Original comment by m...@metamoof.net
on 24 Aug 2011 at 3:25
The problem you've found is ultimately caused by the fact that fictionalley.org
reports all its pages as utf8, even when they're really cp1252.
In fact, most of the older stories I've found there that aren't just ascii were
cp1252. I don't recall seeing one with true utf8 before.
Re-encoding the description only fixes the problem for this particular story,
but won't help when the title, chapter names, story text, etc are true utf8.
I'm intrigued by the idea of using chardet or something like it to detect the
real encoding for sites like fictionalley.org that lie to us.
It's not necessarily going to solve all the problems, though. I believe I once
even saw utf8 and cp1252 on the same page.
Original comment by retiefj...@gmail.com
on 14 Sep 2011 at 6:19
(info update)
chardet can spot utf8 with reasonable confidence in the stories I've tested.
But it keeps incorrectly calling the windows-1252/ISO-8859-1 texts ISO-8859-2
with 70-85% confidence.
Here's an example:
http://www.fictionalley.org/authors/aerie22/DWM01a.html
Original comment by retiefj...@gmail.com
on 14 Sep 2011 at 8:06
Hmmm, I think some level of intelligence is possibly useful here. I believe
fictionalley is 100% english fics, so there's only really two encodings we need
to worry about - utf8 and cp1252.
utf8 is fairly easy to detect, so maybe say "if chardet reckons it's 95% sure
it's utf8, call it utf8, otherwise cp1252" and you should catch nearly all the
edge cases there...
Original comment by m...@metamoof.net
on 14 Sep 2011 at 8:26
That's just what I was thinking. Check it against the list of encodings
suitable for the adapter in case there are non-English sites some day.
I'm also considering making it an optional feature that can be turned on/off
from the ini. I like options, but I'm not sure it's useful.
Original comment by retiefj...@gmail.com
on 14 Sep 2011 at 9:51
I've add chardet to the system, but I don't trust it completely. So I've made
a user customizable option for website encoding and added a pseudo-encoding
'auto' that uses chardet's encoding, but only if it's 90+% confident.
It's checked into HG, but it's not the default web version yet, You can pull it
from HG or try it out here:
http://4-0-5.fanfictionloader.appspot.com/
Be sure to add to your personal.ini(CLI) or User Configuration(web):
[www.fictionalley.org]
website_encodings: auto, Windows-1252, utf8
Please confirm if this works sufficiently for you.
Changeset: 214 (61d72dfc9f63)
Add website_encodings option to change encoding list, add 'auto' as encoding
type.
'auto' uses Universal Encoding Detector(http://chardet.feedparser.org/) to
derive encoding. It spots utf-8 fairly well, but not iso8859-1/windows-1252,
so we require 90+% confidence before using it.
Original comment by retiefj...@gmail.com
on 15 Sep 2011 at 7:56
4.0.5 now default.
Original comment by retiefj...@gmail.com
on 19 Sep 2011 at 9:18
Original issue reported on code.google.com by
m...@metamoof.net
on 24 Aug 2011 at 3:00