Fixes catchup - Githubissues

divergentdave commented 9 years ago

This fixes a variety of scraper issues that have been building up, including 4 partial rewrites for new sites, several missing dates, and support for .docx files.

Since this adds a dependancy on the python-docx modue, deploying will require running pip -r requirements.txt again.

konklone commented 9 years ago

I reviewed each changed inspector, and all the changes are positive and work fantastically. Thanks, @divergentdave! I added a few minor commits that rewrite report/landing page URLs to be HTTPS for a few IGs that have since migrated. In a few cases, we were using HTTPS for their hardcoded URLs, but their HTTPS page was still linking to the HTTP versions of reports and landing pages.

Also, I'm having trouble causing the hhs scraper to download anything to disk and create data in the data/ directory -- but I'm nearly positive that has nothing to do with this PR, so just flagging it before I merge.

divergentdave commented 9 years ago

The HHS scraper is working for me. Are you letting it run to completion? That particular scraper is two-pass, and the first pass takes forever, before it gets to saving reports.

konklone commented 9 years ago

Ohhhhh, yeah, I'm sure that's it. No, I wasn't letting it go to completion, which explains it.

unitedstates / inspectors-general

Fixes catchup #254