used "canned" data in develpment and staging?

philbudne commented 9 months ago

For discussion:

Right now development and staging deployments of story-indexer download the previous day's rss-fetcher generated synthetic RSS file, and download (a random selection? of) 5K or 50K articles.

If we used fixed input RSS files (available at some static location) of different sizes, we could perform more rigorous tests after the pipeline has finished running.

A question is what should the test RSS files point to? We could populate the same server that delivers the RSS files with articles, and, I suppose subvert file download in some way (ie; by using an HTTP proxy to fetch the articles) in a way that makes it appear that the articles come from different domains...

rahulbot commented 8 months ago

This is clearly valuable. Ongoing experiences with GitHub Actions point me in that direction for a solution (but contributes to vendor lock-in). In any case a first pass could use the CI to do a useful version. If I understood right a sketch could look like this:

include test folder in repo with static set of HTML news story files, and RSS file pointing to them
have CI action launch proxy server from web server container image and use that folder as a mount
have CI action start up queue and ES server from container images
launch test script that processes the static RSS file (pointing at proxy address) and validates success by counting things after fixed delay (like number of stories in ES index)

rahulbot commented 7 months ago

Note: #221 might be an enabler of this

mediacloud / story-indexer

used "canned" data in develpment and staging? #193