mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

Final(?) tweaks for processing legacy (CSV) stories for 2023 #234

Closed philbudne closed 9 months ago

philbudne commented 10 months ago

NOTE! Adds RSSEntry fields to preserve feed/source id data in CSV files. PLEASE comment on the question at https://github.com/philbudne/story-indexer/blob/hist-update/indexer/story.py#L119 (file_name field) Fetcher will also accept feed_url, feed_id, source_id from RSS file <source> tag. Added --rss-file option to fetcher, to test parsing new RSS files not yet in production. Set RSSEntry.fetch_date to the date in the name of the CSV file. Cleaned up version picking (not needed for 2023) to use date from CSV file column instead of fetch_date. hist-queuer: let hist-fetcher validate the downloads_id hist-fetcher: quarantine stories with invalid downloads_id parser: drop messages with no HTML (count only, no quarantine)

pgulley commented 10 months ago

Maybe system_source_name or mc_source_name plus a comment, as it's carrying the source of that record within the larger mc system?

philbudne commented 9 months ago

@thepsalmist any review comments?

thepsalmist commented 9 months ago

@thepsalmist any review comments?

LGTM!