unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Remove BeautifulSoup warning, force it to use lxml #252

Closed harrisj closed 8 years ago

harrisj commented 8 years ago

I have explicitly set the code to use lxml as the HTML processor for BeautifulSoup. In addition, I have refactored most of the inspectors to replace the call to utils.download followed by a BeautifulSoup(body) call with single utils.beautifulsoup_from_url lines. This reduced the number of places in the code with BeautifulSoup was instantiated, but there are some exceptions, usually because the scraper is using a post request instead. In the case of HHS, they do a bit more with the response as well.

I have noticed a few scraping errors when testing this change out in the following scrapers:

Many of these look to be issues with changes in the page structure/URLs of the IG sites, but until I can investigate, this PR should be considered a WORK IN PROGRESS and probably not be merged quite yet. Thank you.

harrisj commented 8 years ago

Okay, I think all of the remaining issues I am seeing are scraper problems (sites changing page structure, new reports without dates) and not the result of these changes. I think it's clear for your review if you are ready. Thank you.

divergentdave commented 8 years ago

Looks good to me! :+1:

konklone commented 8 years ago

What an excellent and helpful contribution, @harrisj! Like pruning a bonsai tree.

And @divergentdave, thanks for :eyes:-ing it. You can feel free to merge PRs like this too after reviewing them, I don't mean to be the bottleneck for stuff like this.

harrisj commented 8 years ago

Thank you and sorry for the typo