Closed bonfirefan closed 4 years ago
I'll start on this one.
I'm going to start working on this.
Hey I started trying to scrape the site using the scrapy shell and have encountered this error:
2019-09-07 14:38:05 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET http://www.ahs.dep.pa.gov/CalendarOfEvents/Default.aspx?list=true>
So the website is telling people it doesn't want to be scraped. Where do we want to go from here? Is there maybe another place that could be scraped with similar information that you know of?
Here's the verdict:
Ignore robots.txt files for government sites. The Wayback Machine has started ignoring them generally, and it's really just an internet courtesy thing that shouldn't apply to public information. We left ROBOTSTXT_OBEY as true in the scrapy settings because it's the default, and that way we can more easily keep track of where sites are blocking scrapers by enabling it on a per-spider basis like here https://github.com/City-Bureau/city-scrapers/blob/c23f1fd17b1ced1c01dff295c8cdf1ca191f0244/city_scrapers/spiders/chi_ssa_34.py
So when you run into this, set ROBOTSTXT_OBEY
as False
I am interested in giving this issue a shot and debugging if anyone has made progress or is still giving this a go.
Hey I'm just about finished at this point. Just waiting for the code to be reviewed
Closed by #80
Agency Name: PA Department of Environmental Protection
Spider Name: pa_energy
Website:
http://www.ahs.dep.pa.gov/CalendarOfEvents/Default.aspx?list=true
Scraping Notes:
A list of dates, plaintext