Spider: PA Department of Environmental Protection

pgh-public-meetings / city-scrapers-pitt

Pittsburgh City Scrapers: sourcing public meetings in Pittsburgh

https://pgh-public-meetings.github.io/events/

MIT License

19 stars 66 forks source link

Spider: PA Department of Environmental Protection #44

Closed bonfirefan closed 4 years ago

bonfirefan commented 5 years ago

Agency Name: PA Department of Environmental Protection

Spider Name: pa_energy

Website:

http://www.ahs.dep.pa.gov/CalendarOfEvents/Default.aspx?list=true

Scraping Notes:

A list of dates, plaintext

nickvasko commented 5 years ago

I'll start on this one.

henryCraig commented 4 years ago

I'm going to start working on this.

henryCraig commented 4 years ago

Hey I started trying to scrape the site using the scrapy shell and have encountered this error:

2019-09-07 14:38:05 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET http://www.ahs.dep.pa.gov/CalendarOfEvents/Default.aspx?list=true>

So the website is telling people it doesn't want to be scraped. Where do we want to go from here? Is there maybe another place that could be scraped with similar information that you know of?

bonfirefan commented 4 years ago

Here's the verdict:

Ignore robots.txt files for government sites. The Wayback Machine has started ignoring them generally, and it's really just an internet courtesy thing that shouldn't apply to public information. We left ROBOTSTXT_OBEY as true in the scrapy settings because it's the default, and that way we can more easily keep track of where sites are blocking scrapers by enabling it on a per-spider basis like here https://github.com/City-Bureau/city-scrapers/blob/c23f1fd17b1ced1c01dff295c8cdf1ca191f0244/city_scrapers/spiders/chi_ssa_34.py

So when you run into this, set ROBOTSTXT_OBEY as False

ChrisJabb21 commented 4 years ago

I am interested in giving this issue a shot and debugging if anyone has made progress or is still giving this a go.

henryCraig commented 4 years ago

Hey I'm just about finished at this point. Just waiting for the code to be reviewed

ben-nathanson commented 4 years ago

Closed by #80