pgh-public-meetings / city-scrapers-pitt

Pittsburgh City Scrapers: sourcing public meetings in Pittsburgh
https://pgh-public-meetings.github.io/events/
MIT License
19 stars 66 forks source link

Spider: Pittsburgh Post-Gazette (Mixin Class) #182

Open maxachis opened 3 years ago

maxachis commented 3 years ago

Spider Name:

Pittsburgh Post-Gazette

Website:

https://classmart.post-gazette.com/pa/legal-notices/search

Scraping Notes:

Notes on how to put together a mixin class:

https://github.com/City-Bureau/city-scrapers/issues/439

bonfirefan commented 3 years ago

Patrik notes that Post Gazette has an anti-bot feature on its main domain - tinypass (run by piano), so it is worth looking out similar features on classmart. Also worth looking for classmart API that makes scraping much easier.

bonfirefan commented 3 years ago

Related and redundant issue is #32

0x1F602 commented 3 years ago

I noticed this is run by a company named adperfect. https://post-gazette.adperfect.com/

0x1F602 commented 3 years ago

I'm cooking something up over here: https://github.com/0x1F602/city-scrapers-pitt/commit/7bfda2a43f4d23f7e12c53c2e9de081295dd668c

The data in the classified notices don't strictly fit a particular pattern, which makes it hard to grab start and end times and locations and agencies. Hmm.

I think the new technology here I've added is a sort of "relevancy score" based off regular expressions, to attempt to sort out obits from auctions from meeting notices.

0x1F602 commented 3 years ago

https://github.com/0x1F602/city-scrapers-pitt/commit/1a362682663442ca0f30fc54b55c6427299ea774

I'm experimenting with a named entity recognition library called spacy. So far so good. https://spacy.io/

This is allowing me to at least grab the date and time to combine them into a start datetime. If we take it a lot further, it may be possible to automatically grab addresses and events.