adapters.adapter_fictionalley does not deal with utf8 metadata correctly

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
$ python2.5 downloader.py -m -f html 
http://www.fictionalley.org/authors/worth_12_of_malfoy/resistance.html

What is the expected output? What do you see instead?

In the metadata (handily given by the debug logging) I see:

'description': u'Hogwarts has changed. Severus Snape is Headmaster, Dark Arts 
is on the curriculum, and the shadow of Voldemort\xe2\u20ac\u2122s reign of 
terror hangs heavily over the remaining students. Faced with a choice between 
hope and despair, three students determine to fight back against the new 
regime.  Neville, Ginny and Luna rally the remainder of 
Dumbledore\xe2\u20ac\u2122s Army and form a resistance movement. But the stakes 
are high and they must fight not only the school\xe2\u20ac\u2122s 
administration but their own demons as they struggle to survive in a cruel new 
Hogwarts. This is \xe2\u20ac\u02dcDeathly Hallows\xe2\u20ac\u2122 from the 
perspective of those Harry left behind, who never lost their faith that one day 
he would return, and prepared to fight alongside him for the very future of 
their world.'

What I expect to see is 

'description': u'Hogwarts has changed. Severus Snape is Headmaster, Dark Arts 
is on the curriculum, and the shadow of Voldemort\u2019s reign of terror hangs 
heavily over the remaining students. Faced with a choice between hope and 
despair, three students determine to fight back against the new regime.  
Neville, Ginny and Luna rally the remainder of Dumbledore\u2019s Army and form 
a resistance movement. But the stakes are high and they must fight not only the 
school\u2019s administration but their own demons as they struggle to survive 
in a cruel new Hogwarts. This is \u2018Deathly Hallows\u2019 from the 
perspective of those Harry left behind, who never lost their faith that one day 
he would return, and prepared to fight alongside him for the very future of 
their world.'

- specifically, the curly quotes appear to be utf8 encoded inside the unicode 
string, rather than decoded from utf8 to start with.

What version of the product are you using? On what operating system?

hg HEAD on linux

Please provide any additional information below.

For the record, to get the properly encoded string I had to resort to:

unicode(story.metadata['description'].encode('cp1252'), 'utf8')

Original issue reported on code.google.com by m...@metamoof.net on 24 Aug 2011 at 3:00

GoogleCodeExporter commented 9 years ago

As a workaround, I have the following code that uses 
http://chardet.feedparser.org/ to check for strange encodings:

import chardet

def fix_ffdl_encoding(data):
    ''' Deal with utf8 encoded as a unicode object, amongst others
        Doesn't deal with lists of objects'''

    if not isinstance(data, unicode):
        return data
    try:
        endata = data.encode('cp1252') #standard windows western encoding
    except UnicodeEncodeError:
        #chances are it's utf8, or nor a string
        return data
    results = chardet.detect(endata)
    if results['confidence'] > 0.8:
        return unicode(endata, results['encoding'])
    else:
        return data

Original comment by m...@metamoof.net on 24 Aug 2011 at 3:25

GoogleCodeExporter commented 9 years ago

The problem you've found is ultimately caused by the fact that fictionalley.org 
reports all its pages as utf8, even when they're really cp1252.

In fact, most of the older stories I've found there that aren't just ascii were 
cp1252.  I don't recall seeing one with true utf8 before.

Re-encoding the description only fixes the problem for this particular story, 
but won't help when the title, chapter names, story text, etc are true utf8.

I'm intrigued by the idea of using chardet or something like it to detect the 
real encoding for sites like fictionalley.org that lie to us.  

It's not necessarily going to solve all the problems, though.  I believe I once 
even saw utf8 and cp1252 on the same page.

Original comment by retiefj...@gmail.com on 14 Sep 2011 at 6:19

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

(info update)
chardet can spot utf8 with reasonable confidence in the stories I've tested.  
But it keeps incorrectly calling the windows-1252/ISO-8859-1 texts ISO-8859-2 
with 70-85% confidence.

Here's an example:
http://www.fictionalley.org/authors/aerie22/DWM01a.html

Original comment by retiefj...@gmail.com on 14 Sep 2011 at 8:06

GoogleCodeExporter commented 9 years ago

Hmmm, I think some level of intelligence is possibly useful here. I believe 
fictionalley is 100% english fics, so there's only really two encodings we need 
to worry about - utf8 and cp1252.

utf8 is fairly easy to detect, so maybe say "if chardet reckons it's 95% sure 
it's utf8, call it utf8, otherwise cp1252" and you should catch nearly all the 
edge cases there...

Original comment by m...@metamoof.net on 14 Sep 2011 at 8:26

GoogleCodeExporter commented 9 years ago

That's just what I was thinking.  Check it against the list of encodings 
suitable for the adapter in case there are non-English sites some day.

I'm also considering making it an optional feature that can be turned on/off 
from the ini.  I like options, but I'm not sure it's useful.

Original comment by retiefj...@gmail.com on 14 Sep 2011 at 9:51

GoogleCodeExporter commented 9 years ago

I've add chardet to the system, but I don't trust it completely.  So I've made 
a user customizable option for website encoding and added a pseudo-encoding 
'auto' that uses chardet's encoding, but only if it's 90+% confident.

It's checked into HG, but it's not the default web version yet, You can pull it 
from HG or try it out here:

http://4-0-5.fanfictionloader.appspot.com/

Be sure to add to your personal.ini(CLI) or User Configuration(web):

[www.fictionalley.org]
website_encodings: auto, Windows-1252, utf8

Please confirm if this works sufficiently for you.

Changeset: 214 (61d72dfc9f63) 
Add website_encodings option to change encoding list, add 'auto' as encoding 
type.
'auto' uses Universal Encoding Detector(http://chardet.feedparser.org/) to
derive encoding.  It spots utf-8 fairly well, but not iso8859-1/windows-1252,
so we require 90+% confidence before using it.

Original comment by retiefj...@gmail.com on 15 Sep 2011 at 7:56

Changed state: Started

GoogleCodeExporter commented 9 years ago

4.0.5 now default.

Original comment by retiefj...@gmail.com on 19 Sep 2011 at 9:18

Changed state: Fixed

google-code-export / fanficdownloader

adapters.adapter_fictionalley does not deal with utf8 metadata correctly #18