Open maxachis opened 3 years ago
Patrik notes that Post Gazette has an anti-bot feature on its main domain - tinypass (run by piano), so it is worth looking out similar features on classmart. Also worth looking for classmart API that makes scraping much easier.
Related and redundant issue is #32
I noticed this is run by a company named adperfect. https://post-gazette.adperfect.com/
I'm cooking something up over here: https://github.com/0x1F602/city-scrapers-pitt/commit/7bfda2a43f4d23f7e12c53c2e9de081295dd668c
The data in the classified notices don't strictly fit a particular pattern, which makes it hard to grab start and end times and locations and agencies. Hmm.
I think the new technology here I've added is a sort of "relevancy score" based off regular expressions, to attempt to sort out obits from auctions from meeting notices.
https://github.com/0x1F602/city-scrapers-pitt/commit/1a362682663442ca0f30fc54b55c6427299ea774
I'm experimenting with a named entity recognition library called spacy. So far so good. https://spacy.io/
This is allowing me to at least grab the date and time to combine them into a start datetime. If we take it a lot further, it may be possible to automatically grab addresses and events.
Spider Name:
Pittsburgh Post-Gazette
Website:
https://classmart.post-gazette.com/pa/legal-notices/search
Scraping Notes:
Notes on how to put together a mixin class:
https://github.com/City-Bureau/city-scrapers/issues/439