mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
1 stars 4 forks source link

used "canned" data in develpment and staging? #193

Open philbudne opened 9 months ago

philbudne commented 9 months ago

For discussion:

Right now development and staging deployments of story-indexer download the previous day's rss-fetcher generated synthetic RSS file, and download (a random selection? of) 5K or 50K articles.

If we used fixed input RSS files (available at some static location) of different sizes, we could perform more rigorous tests after the pipeline has finished running.

A question is what should the test RSS files point to? We could populate the same server that delivers the RSS files with articles, and, I suppose subvert file download in some way (ie; by using an HTTP proxy to fetch the articles) in a way that makes it appear that the articles come from different domains...

rahulbot commented 8 months ago

This is clearly valuable. Ongoing experiences with GitHub Actions point me in that direction for a solution (but contributes to vendor lock-in). In any case a first pass could use the CI to do a useful version. If I understood right a sketch could look like this:

rahulbot commented 7 months ago

Note: #221 might be an enabler of this