mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 3 forks source link

Crash because uri.query.params['nk'] can be None #45

Closed vbanos closed 9 months ago

vbanos commented 1 year ago

https://github.com/mediacloud/metadata-lib/blob/57ecff0f9c4b41bad88bf541935ebf0a984d67d3/mcmetadata/urls.py#L176

Traceback (most recent call last):
  File "index-from-cdx.py", line 194, in <module>
    po(rec)
  File "index-from-cdx.py", line 154, in __call__
    metadata = extract_metadata(cdx_record)
  File "index-from-cdx.py", line 97, in extract_metadata
    metadata = extract(url=url, html_text=res.data.decode('utf-8'))
  File "/opt/wayback-search-venv/lib/python3.8/site-packages/mcmetadata/__init__.py", line 59, in extract
    normalized_url = urls.normalize_url(final_url)
  File "/opt/wayback-search-venv/lib/python3.8/site-packages/mcmetadata/urls.py", line 247, in normalize_url
    url = _remove_query_params(url)
  File "/opt/wayback-search-venv/lib/python3.8/site-packages/mcmetadata/urls.py", line 177, in _remove_query_params
    for nk_value in uri.query.params['nk']:
TypeError: 'NoneType' object is not iterable
pgulley commented 1 year ago

Do you have an example of a url which causes this error in the wild? I would like to add a test case.

rahulbot commented 9 months ago

Closing because we don't have a strong test case, and #46 did some work on this.