pgh-public-meetings / city-scrapers-pitt

Pittsburgh City Scrapers: sourcing public meetings in Pittsburgh
https://pgh-public-meetings.github.io/events/
MIT License
19 stars 66 forks source link

Putting some pieces into place to facilitate local integration testing #184

Closed will-snavely closed 3 years ago

will-snavely commented 3 years ago

I worked on this last night and wanted to get some feedback on the idea (see: https://github.com/pgh-public-meetings/city-scrapers-pitt/issues/183) . Basically, I wanted to try and make it easier to integration test the full app locally, since this was useful for me to debug some recent production scraper issues. The basic changes are:

  1. Creating a dev scrapy configuration that mirrors more of what happens in production.
  2. Extending operations from city_scraper_core to work on spider results stored in a local directory. The idea here is to allow basic integration testing w/o requiring an S3 bucket. This entails:
    • Configuring scrapy to output files to a local directory (see city_scrapers/settings/dev.py; basically we set FEED_OUTPUT_DIRECTORY to a local directory to make this work). Current proposal is to, by default, store spider results in a directory called output in the repository root (which is .gitignored).
    • Tweaking the combinefeeds from the city_scrapers_core module to work with a local filesystem, instead of just S3/Azure.
    • Adding a simple HTTP server for serving the spider output folder.

I added some initial documentation about how this works. At the moment, just looking for feedback/discussion.

will-snavely commented 3 years ago

This is ready for review, now.

bonfirefan commented 3 years ago

Taking a look, merging in today with luck dogwing

bonfirefan commented 3 years ago

This looks good @will-snavely - we just want to add documentation that specifies to remove the Output folder before each scrapy crawl run or else we will get duplicates