mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
1 stars 4 forks source link

indexer.story.RawHTML.guess_encoding issues #201

Open philbudne opened 9 months ago

philbudne commented 9 months ago

NOTE! I saw this using requests to fetch URLs rather than scrapy.

Our synthetic RSS feed includes non-text documents (image, video, pdf), which I think scrapy is silently discarding, so this is NOT an issue that directly effects current production.

If RawHTML.encoding is None, referencing the RawHTML.unicode property calls RawHTML.guess_encoding()

There are several issues with guess_encoding:

  1. It uses chardet.detect, which my past testing has shown to be flat out wrong, ignoring indications (XML header lines, HTML meta tags) that indicate that the document is UTF-8 (*)
  2. guess_encoding() attempts to set self.encoding without a with clause, which throws an exception Attempting write on frozen StoryData....

Since this code path is NOT currently exercised I don't think the right thing to do is to fix it, but rather to treat the case of accessing the unicode property when RawHTML.encoding is not set as a fatal error, instead of fixing it to enable code that makes known dubious choices.

A further concern that if RawHTML.unicode is referenced inside a with clause, adding a with clause inside the unicode property would require nested with clauses to work!!

(*) There appear to be no less than three libraries for sniffing HTML document encoding:

  1. the scrapy project's w3lib.encoding (used by scrapy)
  2. The deprecated requests.util.get_encodings_from_content, which has migrated to requests_toolbelt.utils.deprecated
  3. Beautiful Soup v4's bs4.dammit UnicodeDammit (and associated EncodingDetector).

No doubt there are others!!!!

rahulbot commented 9 months ago

Prioritizing and assigned based on the guess this will need to be tackled as part of #128 (because I think it will exercise this code path).

philbudne commented 9 months ago

For my queue based fetcher, I worked around this, doing decode outside of the Story class using BS4's UnicodeDammit in the parser, using the RawHTML sub-object of Story as just a container.

rahulbot commented 8 months ago

Note: think we're unlikely to hit this, no urgent action needed.