mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

harden StoryArchiveReader??? #220

Open philbudne opened 10 months ago

philbudne commented 10 months ago

I originally wrote StoryArchiveReader (in the now misnamed story_archive_writer.py, to keep it near the 'Writer) as a quick and dirty one-off for testing. It is not a general purpose WARC reader, and assumes the contents of the WARC (after the initial "warcinfo" record) is alternate "response" records (w/ "200 OK" status) and "metadata" records w/ content-type application/x.mediacloud-indexer+json.

At the very least, it could:

  1. document that it was a NON-goal to accept arbitrary WARC files.
  2. document what it expects
  3. check that the warcinfo record contains software: mediacloud story-indexer ArchiveWriter and give (at least) a warning???

Thoughts: The WARC specification isn't particularly strict or prescriptive, and any legal WARC file is likely an open-ended task. StoryArchiveReader is simple on purpose, and enhancing it for other uses might be a bad idea. If it's necessary to be able to read some other WARC files (of a particular format), it might be best to write a new class that accepts those WARC records, and to wrap the instantiation of a Reader into a function that looks at the "warcinfo" record and picks the right reader.