Open philbudne opened 9 months ago
Prioritizing and assigned based on the guess this will need to be tackled as part of #128 (because I think it will exercise this code path).
For my queue based fetcher, I worked around this, doing decode outside of the Story class using BS4's UnicodeDammit in the parser, using the RawHTML sub-object of Story as just a container.
Note: think we're unlikely to hit this, no urgent action needed.
NOTE! I saw this using
requests
to fetch URLs rather than scrapy.Our synthetic RSS feed includes non-text documents (image, video, pdf), which I think scrapy is silently discarding, so this is NOT an issue that directly effects current production.
If
RawHTML.encoding
isNone
, referencing theRawHTML.unicode
property callsRawHTML.guess_encoding()
There are several issues with guess_encoding:
chardet.detect
, which my past testing has shown to be flat out wrong, ignoring indications (XML header lines, HTML meta tags) that indicate that the document is UTF-8 (*)guess_encoding()
attempts to setself.encoding
without awith
clause, which throws an exceptionAttempting write on frozen StoryData....
Since this code path is NOT currently exercised I don't think the right thing to do is to fix it, but rather to treat the case of accessing the
unicode
property whenRawHTML.encoding
is not set as a fatal error, instead of fixing it to enable code that makes known dubious choices.A further concern that if RawHTML.unicode is referenced inside a
with
clause, adding awith
clause inside theunicode
property would require nestedwith
clauses to work!!(*) There appear to be no less than three libraries for sniffing HTML document encoding:
No doubt there are others!!!!