Closed thepsalmist closed 6 months ago
Some preliminary comments:
HIST_YEAR=2023
and add ELASTICSEARCH_SNAPSHOT_REPO
)? It's best practice to have PRs based on main.# let hist-fetcher quarantine if bad
should go awayurls_seen
is no longer meaningful (since the CSV file is unlikely to have come from the legacy system), BUT filtering for duplicate urls is probably still a good idea (doesn't cost much, and can save time/effort).# content_metadata.parsed_date is not set, so parser.py will
can go away rss.source_feed_id = None
and rss.source_source_id = None
can go away (should be the default value)rss.source_url = url
can go away: we don't have the URL of the RSS file the story was found in.PIPE_TYPE_PFX='hist-'
I suggest use csv-
and setting PORT_BIAS=600
to allow co-existence with other stacks
The PR builds from #271 to fetch urls for CSV files on S3 for Database E dates 01/25-02/17