Closed philbudne closed 10 months ago
In this case I think it is revealing a real bug. The https://www.wdsu.com/
is a good edge case, having only an <h1>
and no other title indicators. I've added the fix for that.
The https://focustaiwan.tw
examples aren't returned 200 for me so I can't replicate. In my browser they render fine and have a title tag that should work fine. Does the quarantine record include the HTML associated with the error? If so, can you share for one of the https://focustaiwan.tw
ones?
More broadly, this is the wild wild web, so we're always going to have pages that we can't process because they demonstrate some edge case that breaks our metadata extraction. I agree that catching and falling back on some default could hide larger problems, in that case specifically with title parsing, especially because this library are still maturing in its current form.
Yes, full Story objects are quarantined. Here is the HTML for https://focustaiwan.tw/business/202311170015.htm
The title tag contains only white space chars (CR, LF, TAB)
Seen in story-indexer parser quarantine queue three times:
With the offending line: https://github.com/mediacloud/metadata-lib/blob/main/mcmetadata/titles.py#L95
I don't have any opinion on what the right fix is.
Catching/ignoring too many exceptions will sweep bugs under the rug.
In a sense, that's why we have a quarantine queue: to catch stories that trip us up, so we can make the code more robust.
I suppose one possible answer is that maybe the story-indexer parser needs to pull the
extract
functionality in-line, so it can make decisions about what can or cannot be ignores???