Closed will-snavely closed 3 years ago
This is ready for review, now.
Taking a look, merging in today with luck
This looks good @will-snavely - we just want to add documentation that specifies to remove the Output
folder before each scrapy crawl
run or else we will get duplicates
I worked on this last night and wanted to get some feedback on the idea (see: https://github.com/pgh-public-meetings/city-scrapers-pitt/issues/183) . Basically, I wanted to try and make it easier to integration test the full app locally, since this was useful for me to debug some recent production scraper issues. The basic changes are:
dev
scrapy configuration that mirrors more of what happens in production.city_scraper_core
to work on spider results stored in a local directory. The idea here is to allow basic integration testing w/o requiring an S3 bucket. This entails:output
in the repository root (which is.gitignored
).combinefeeds
from thecity_scrapers_core
module to work with a local filesystem, instead of just S3/Azure.I added some initial documentation about how this works. At the moment, just looking for feedback/discussion.