Closed danwarren closed 5 years ago
Thanks for tackling the hardest one out there, Dan! It looks like you're very close and will need to add a PDF testing file in addition to your html file to your test files. Out of curiosity, what was the rationale for using PyPDF2
over the other python pdf libraries?
I guess i kinda stole you "hard mode" example, sorry about that :)
RE:PyPDF2, I googled example code for the libraries with python3 support, and the interface of PyPDF2 just looked more readable and pythonic to me. In the case of this site the pdfs have a hearing per-page, so being able to handle pages intuitively was a high priority.
I did confine the PDF processing to a single function, but if the text decoder (or the PDF build method) is changed, the regex are likely to fail, so those two things could be switched out together if a different library is selected as official.
Now that I figured out Pipfile format and it's actually building :) , it looks like regex failures break the test... so i can put in failover regex, and then put in known defaults if there are no matches.
Do we have a standard way of differentiating defaults from scrapped data in the message output ?
In the case of zoning, the address ( for instance) appears to be consistent, but I would not want to enter the default if the searcher could not distinguish a scraped address from a guess.
Please be aware I DISABLED scrapy verify in the .travis.yml file, since it would totally break our ability to use travis.CI if this were merged back to master with that enabled.
Hey Dan, I merged in the new update into your branch and I think it works, but I don't have permission to push to your branch. If you can merge in changes from upstream/master
I think we'll be able to finally merge in this pull request!
NEW DEPENDANCY PyPDF2
This is a sometimes working, but not perfect PDF spider for #8