pgh-public-meetings / city-scrapers-pitt

Pittsburgh City Scrapers: sourcing public meetings in Pittsburgh
https://pgh-public-meetings.github.io/events/
MIT License
19 stars 66 forks source link

0008 spider pitt zoning #34

Closed danwarren closed 5 years ago

danwarren commented 5 years ago

NEW DEPENDANCY PyPDF2

This is a sometimes working, but not perfect PDF spider for #8

bonfirefan commented 5 years ago

Thanks for tackling the hardest one out there, Dan! It looks like you're very close and will need to add a PDF testing file in addition to your html file to your test files. Out of curiosity, what was the rationale for using PyPDF2 over the other python pdf libraries?

danwarren commented 5 years ago

I guess i kinda stole you "hard mode" example, sorry about that :)

RE:PyPDF2, I googled example code for the libraries with python3 support, and the interface of PyPDF2 just looked more readable and pythonic to me. In the case of this site the pdfs have a hearing per-page, so being able to handle pages intuitively was a high priority.

I did confine the PDF processing to a single function, but if the text decoder (or the PDF build method) is changed, the regex are likely to fail, so those two things could be switched out together if a different library is selected as official.

danwarren commented 5 years ago

Now that I figured out Pipfile format and it's actually building :) , it looks like regex failures break the test... so i can put in failover regex, and then put in known defaults if there are no matches.

Do we have a standard way of differentiating defaults from scrapped data in the message output ?

In the case of zoning, the address ( for instance) appears to be consistent, but I would not want to enter the default if the searcher could not distinguish a scraped address from a guess.

danwarren commented 5 years ago

Please be aware I DISABLED scrapy verify in the .travis.yml file, since it would totally break our ability to use travis.CI if this were merged back to master with that enabled.

bonfirefan commented 5 years ago

Hey Dan, I merged in the new update into your branch and I think it works, but I don't have permission to push to your branch. If you can merge in changes from upstream/master I think we'll be able to finally merge in this pull request!