harden StoryArchiveReader???

I originally wrote StoryArchiveReader (in the now misnamed story_archive_writer.py, to keep it near the 'Writer) as a quick and dirty one-off for testing. It is not a general purpose WARC reader, and assumes the contents of the WARC (after the initial "warcinfo" record) is alternate "response" records (w/ "200 OK" status) and "metadata" records w/ content-type application/x.mediacloud-indexer+json.

At the very least, it could:

document that it was a NON-goal to accept arbitrary WARC files.
document what it expects
check that the warcinfo record contains software: mediacloud story-indexer ArchiveWriter and give (at least) a warning???

Thoughts: The WARC specification isn't particularly strict or prescriptive, and any legal WARC file is likely an open-ended task. StoryArchiveReader is simple on purpose, and enhancing it for other uses might be a bad idea. If it's necessary to be able to read some other WARC files (of a particular format), it might be best to write a new class that accepts those WARC records, and to wrap the instantiation of a Reader into a function that looks at the "warcinfo" record and picks the right reader.

mediacloud / story-indexer

harden StoryArchiveReader??? #220