Closed harrisj closed 9 years ago
Okay, I think all of the remaining issues I am seeing are scraper problems (sites changing page structure, new reports without dates) and not the result of these changes. I think it's clear for your review if you are ready. Thank you.
Looks good to me! :+1:
What an excellent and helpful contribution, @harrisj! Like pruning a bonsai tree.
And @divergentdave, thanks for :eyes:-ing it. You can feel free to merge PRs like this too after reviewing them, I don't mean to be the bottleneck for stuff like this.
Thank you and sorry for the typo
I have explicitly set the code to use
lxml
as the HTML processor for BeautifulSoup. In addition, I have refactored most of the inspectors to replace the call toutils.download
followed by aBeautifulSoup(body)
call with singleutils.beautifulsoup_from_url
lines. This reduced the number of places in the code with BeautifulSoup was instantiated, but there are some exceptions, usually because the scraper is using a post request instead. In the case of HHS, they do a bit more with the response as well.I have noticed a few scraping errors when testing this change out in the following scrapers:
Many of these look to be issues with changes in the page structure/URLs of the IG sites, but until I can investigate, this PR should be considered a WORK IN PROGRESS and probably not be merged quite yet. Thank you.