oduwsdl / off-topic-memento-toolkit

This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.
MIT License
9 stars 4 forks source link

TimeMap (URI-T) input is resulting in JSONDecodeError and KeyError #7

Closed kritikagarg closed 2 years ago

kritikagarg commented 2 years ago
$ detect_off_topic -i timemap="http://web.archive.org/web/*/https://twitter.com/kritika_garg/" -o outputfile.json

2022-09-11 13:06:44,266 [INFO] __main__: Starting topic analysis run.
2022-09-11 13:06:44,268 [INFO] __main__: Acquiring memento colleciton using input type timemap
2022-09-11 13:06:44,268 [INFO] __main__: TimeMap measures chosen: {'cosine': 0.12}
2022-09-11 13:06:44,268 [INFO] otmt.input_types: Using input type timemap
2022-09-11 13:06:44,268 [INFO] otmt.input_types: Working directory /tmp/otmt-working will be used
2022-09-11 13:06:44,268 [INFO] otmt.collectionmodel: loading data from directory /tmp/otmt-working/timemaps
2022-09-11 13:06:44,272 [INFO] otmt.input_types: Acquiring collection model from TimeMap at [http://web.archive.org/web/*/https://twitter.com/kritika_garg/]
Traceback (most recent call last):
  File "/home/kritika/.local/lib/python3.8/site-packages/otmt/collectionmodel.py", line 248, in addTimeMap
    json_timemap = json.loads(content)
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kritika/.local/bin/detect_off_topic", line 198, in <module>
    cm = otmt.get_collection_model(
  File "/home/kritika/.local/lib/python3.8/site-packages/otmt/input_types.py", line 678, in get_collection_model
    return supported_input_types[input_type](arguments, working_directory)
  File "/home/kritika/.local/lib/python3.8/site-packages/otmt/input_types.py", line 563, in get_collection_model_from_timemap
    cm.addTimeMap(urit, content, headers)
  File "/home/kritika/.local/lib/python3.8/site-packages/otmt/collectionmodel.py", line 278, in addTimeMap
    json_timemap = convert_LinkTimeMap_to_dict(content, skipErrors=True)
  File "/home/kritika/.local/lib/python3.8/site-packages/otmt/timemap.py", line 159, in convert_LinkTimeMap_to_dict
    process_local_dict(local_dict, dict_timemap)
  File "/home/kritika/.local/lib/python3.8/site-packages/otmt/timemap.py", line 43, in process_local_dict
    relation = local_dict[uri]["rel"]
KeyError: 'rel'

https://github.com/oduwsdl/off-topic-memento-toolkit/blob/77cba436a6219fab8b8b7c773c3fc273388a8a2c/otmt/timemap.py#L43

shawnmjones commented 2 years ago

The TimeMap URI supplied to the detect_off_topic command above points to a human-readable TimeMap. OTMT cannot parse human-readable TimeMaps. It works with a URI-T linking to a Memento Protocol machine-readable TimeMap. In the case above, you'll want to use the URI-T at http://web.archive.org/web/timemap/link/https://twitter.com/kritika_garg

I agree that the error message could be much better though. OTMT essentially just blew up because you gave it unexpected input.

Try the suggested URI-T and let me know if it works.

kritikagarg commented 2 years ago

Thank you for the prompt reply, @shawnmjones !! The suggested URI-T works. Maybe we could also add this information to the README; that would be helpful.

shawnmjones commented 2 years ago

I totally agree. Please feel free to edit the README and commit the changes for others. 🙂