The code Media Cloud server wasn't performing well, so we made this quick and dirty backup project. It gets a prefilled in list of the RSS feeds MC usually scrapes each day (~130k). Then throughout the day it tries to fetch those. Every night it generates a synthetic RSS feed with all those URLs.
Files are available afterwards at http://my.server/rss/mc-YYYY-MM-dd.rss.gz
.
See documentation in doc/ for more details.
For development using dokku, see doc/deployment.md
For development directly on your local machine:
python -mvenv venv
source venv/bin/activate
pip install -r requirements.txt
sudo -u postgres createuser -s MYUSERNAME
createdb rss-fetcher
alembic upgrade head
to initialize database.cp .env.template .env
(little or no editing should be needed)BOTH should be run before merging to main (or submitting a pull request).
All config parameters should be fetched via fetcher/config.py and added to .env.template
Various scripts run each separate component:
python -m scripts.import_feeds my-feeds.csv
: Use this to import from a CSV dump of feeds (a one-time operation)run-fetch-rss-feeds.sh
: Start fetcher (leader and worker processes)run-server.sh
: Run API serverrun-gen-daily-story-rss.sh
: Generate the daily files of URLs found on each day (run nightly)python -m scripts.db_archive
: archive and trim fetch_events and stories tables (run nightly)See doc/deployment.md and dokku-scripts/README.md for procedures and scripts.