Closed hancush closed 5 years ago
The scraper classes all have retry logic builtin, as the classes derive from the scrapelib Scraper class.
All the sentry errors seems to be related to a place in the code where we don't use the scrapers request method, but use the requests library directly.
So, why did we do this instead of self.head
If I recall correctly, this had to do with self.head acting kind of weirdly. I think this has to do with the caching portions of scrapelib.
I would recommend we try something like this.
from scrapelib import RetrySessions
...
retry_session = RetrySession()
legislation_detail_route = retry_session.head(
gateway_url.format(matter_id)).headers['Location']
@fgregg i'd like to understand more about why self.head
didn't work. this docstring seems like it could be instructive:
if we never want to cache certain pages, it looks like scrapelib
provides a method to determine whether responses should be cached.
would it be appropriate to override that method so responses from the gateway_url
are not cached? i could also see a more generalizable pattern like:
class LegistarScraper(LegistarSession, scrapelib.Scraper):
NEVER_CACHE = []
def should_cache_response(self, response):
if any(response.url.startswith(url) for url in self.NEVER_CACHE):
return False
return super().should_cache_response(response)
where NEVER_CACHE
is set on the typed scrapers, e.g., the events url on the events scraper, the gateway url on the bills scraper...
sounds like a good approach. Definitely worth investigating why we didn't like self.head
On Mon, Apr 15, 2019 at 4:29 PM Hannah Cushman notifications@github.com wrote:
@fgregg https://github.com/fgregg i'd like to understand more about why self.head didn't work. this docstring seems like it could be instructive:
if we never want to cache certain pages, it looks like scrapelib provides a method to determine whether responses should be cached https://github.com/jamesturk/scrapelib/blob/dcae9fa86f1fdcc4b4e90dbca12c8063bcb36525/scrapelib/cache.py#L35 .
would it be appropriate to override that method so responses from the gateway_url are not cached? i could also see a more generalizable pattern like:
class LegistarScraper(LegistarSession, scrapelib.Scraper):
NEVER_CACHE = [] def should_cache_response(self, response): if any(resp.url.startswith(url) for url in self.NEVER_CACHE): return False return super().should_cache_response(response)
where NEVER_CACHE is set on the typed scrapers, e.g., the events url on the events scraper, the gateway url on the bills scraper...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencivicdata/python-legistar-scraper/issues/83#issuecomment-483426096, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgxbbCwZNU1d2rl2WWQjerwp0I3VM_gks5vhO8ggaJpZM4cwQnf .
i've done some testing and logging locally, and it looks like the approach will work. however, it also seems that the responses to head
requests are not cached... so the approach won't help us here. :-)
going back to why we didn't like self.head
, that line got changed from self
to requests
here: https://github.com/opencivicdata/python-legistar-scraper/commit/73ee7f2bdb142fe213ff87eef30e355414e45f60
looks like you were making some changes to skip retrying when the votes page didn't exist. i can't immediately tell how the head
change is related... i know it's been a while, but does this commit, or the surrounding commits, jog your memory?
it shoudn't be related. sorry, I can't recollect more