Remove BeautifulSoup warning, force it to use lxml

harrisj commented 9 years ago

I have explicitly set the code to use lxml as the HTML processor for BeautifulSoup. In addition, I have refactored most of the inspectors to replace the call to utils.download followed by a BeautifulSoup(body) call with single utils.beautifulsoup_from_url lines. This reduced the number of places in the code with BeautifulSoup was instantiated, but there are some exceptions, usually because the scraper is using a post request instead. In the case of HHS, they do a bit more with the response as well.

I have noticed a few scraping errors when testing this change out in the following scrapers:

CFTC
DOT
EPA
FCA
HHS
NCUA
SEC
Treasury
USPS

Many of these look to be issues with changes in the page structure/URLs of the IG sites, but until I can investigate, this PR should be considered a WORK IN PROGRESS and probably not be merged quite yet. Thank you.

harrisj commented 9 years ago

Okay, I think all of the remaining issues I am seeing are scraper problems (sites changing page structure, new reports without dates) and not the result of these changes. I think it's clear for your review if you are ready. Thank you.

divergentdave commented 9 years ago

Looks good to me! :+1:

konklone commented 9 years ago

What an excellent and helpful contribution, @harrisj! Like pruning a bonsai tree.

And @divergentdave, thanks for :eyes:-ing it. You can feel free to merge PRs like this too after reviewing them, I don't mean to be the bottleneck for stuff like this.

harrisj commented 9 years ago

Thank you and sorry for the typo

unitedstates / inspectors-general

Remove BeautifulSoup warning, force it to use lxml #252