mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

Handling of URL parse failure #58

Closed philbudne closed 1 year ago

philbudne commented 1 year ago

Crash today in the rss-fetcher. An RSS "link" seems to have had whitespace: ' https://www.letras.com.br/banda-n-drive/eden ' and called mcmetadata.urls.is_homepage_url(link) which it expects to throw only ValueError or TypeError . This is inside a larger try that catches any of AttributeError, KeyError, ValueError, UnicodeError . The crash happened in is_homepage_url at uri_host = uri.host.lower() because uri was None. The questions are:

This isn't pressing, it's the first time I've seen the exception

@rahulbot responded:

I think this is reasonable cleanup for mcmetadata to include.