mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

Enable Historical Re-ingest by canonical-url to cover absent download IDs #338

Open pgulley opened 1 month ago

pgulley commented 1 month ago

This involves two changes:

  1. Update the story object to store the canonical-url that's now produced by mc-metadata at the extract step.

    • also need to make sure we fallback to use the canonical url at the index step
  2. Update the historical ingest process to accept input csvs without a download-id