mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

fix/csv queuer fetcher #288

Closed thepsalmist closed 6 months ago

thepsalmist commented 6 months ago

The PR builds from #271 to fetch urls for CSV files on S3 for Database E dates 01/25-02/17

philbudne commented 6 months ago

Some preliminary comments:

  1. The branch seems to contain other stuff (HIST_YEAR=2023 and add ELASTICSEARCH_SNAPSHOT_REPO)? It's best practice to have PRs based on main.
  2. I think the pipeline type "csv-fetcher" is misleading; I suggest using "csv": It runs a different queuer that feeds into the (regular) queue-based fetcher and I think it's likely to be (re)used in the future, as opposed to being used only once.
  3. In csv-queuer:
    • I don't think any of the commented out code in csv-queuer.py needs to be kept around.
    • the comment # let hist-fetcher quarantine if bad should go away
    • the large block comment on urls_seen is no longer meaningful (since the CSV file is unlikely to have come from the legacy system), BUT filtering for duplicate urls is probably still a good idea (doesn't cost much, and can save time/effort).
    • the comment block # content_metadata.parsed_date is not set, so parser.py will can go away
    • rss.source_feed_id = None and rss.source_source_id = None can go away (should be the default value)
    • rss.source_url = url can go away: we don't have the URL of the RSS file the story was found in.
  4. in deploy.sh:
    • PIPE_TYPE_PFX='hist-' I suggest use csv- and setting PORT_BIAS=600 to allow co-existence with other stacks
    • Would be good to set "ARCH_SUFFIX=csv" so the generated WARC files are distinct from the current day ones