Enable Historical Re-ingest by canonical-url to cover absent download IDs

mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.

https://mediacloud.org

Apache License 2.0

2 stars 5 forks source link

Open pgulley opened 1 month ago

pgulley commented 1 month ago

This involves two changes:

Update the story object to store the canonical-url that's now produced by mc-metadata at the extract step.
- also need to make sure we fallback to use the canonical url at the index step
Update the historical ingest process to accept input csvs without a download-id