mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

mcmetadata.extract throwing AttributeErrors #73

Closed philbudne closed 10 months ago

philbudne commented 10 months ago

Seen in story-indexer parser quarantine queue three times:

(venv) pbudne@ramos:~/story-indexer$ egrep 'final_url|x-mc' *.warc
    "final_url": "https://focustaiwan.tw/business/202311170015",
    "x-mc-who": "parser",
    "x-mc-when": "1702553283.220337",
    "x-mc-what": "AttributeError(\"'NoneType' object has no attribute 'strip'\")",
    "x-mc-where": "/usr/local/lib/python3.10/site-packages/mcmetadata/titles.py:95",
    "x-mc-name": "from_html"

    "final_url": "https://focustaiwan.tw/business/202311170015",
    "x-mc-who": "parser",
    "x-mc-when": "1702553328.8383992",
    "x-mc-what": "AttributeError(\"'NoneType' object has no attribute 'strip'\")",
    "x-mc-where": "/usr/local/lib/python3.10/site-packages/mcmetadata/titles.py:95",
    "x-mc-name": "from_html"

    "final_url": "https://www.wdsu.com/article/untitled-content-1701813119/46044845",
    "x-mc-who": "parser",
    "x-mc-when": "1702554682.876035",
    "x-mc-what": "AttributeError(\"'NoneType' object has no attribute 'strip'\")",
    "x-mc-where": "/usr/local/lib/python3.10/site-packages/mcmetadata/titles.py:95",
    "x-mc-name": "from_html"

With the offending line: https://github.com/mediacloud/metadata-lib/blob/main/mcmetadata/titles.py#L95

        if (len(h1_title) > SHORT_TITLE_THRESHOLD) and (h1_title in title.strip()):

I don't have any opinion on what the right fix is.

Catching/ignoring too many exceptions will sweep bugs under the rug.

In a sense, that's why we have a quarantine queue: to catch stories that trip us up, so we can make the code more robust.

I suppose one possible answer is that maybe the story-indexer parser needs to pull the extract functionality in-line, so it can make decisions about what can or cannot be ignores???

rahulbot commented 10 months ago

In this case I think it is revealing a real bug. The https://www.wdsu.com/ is a good edge case, having only an <h1> and no other title indicators. I've added the fix for that.

The https://focustaiwan.tw examples aren't returned 200 for me so I can't replicate. In my browser they render fine and have a title tag that should work fine. Does the quarantine record include the HTML associated with the error? If so, can you share for one of the https://focustaiwan.tw ones?

More broadly, this is the wild wild web, so we're always going to have pages that we can't process because they demonstrate some edge case that breaks our metadata extraction. I agree that catching and falling back on some default could hide larger problems, in that case specifically with title parsing, especially because this library are still maturing in its current form.

philbudne commented 10 months ago

Yes, full Story objects are quarantined. Here is the HTML for https://focustaiwan.tw/business/202311170015.htm

philbudne commented 10 months ago

The title tag contains only white space chars (CR, LF, TAB)