opencivicdata / python-legistar-scraper

Scrapes municipal data from Legistar websites
BSD 3-Clause "New" or "Revised" License
42 stars 28 forks source link

retry requests that raise 104 errors #83

Closed hancush closed 5 years ago

hancush commented 5 years ago

for metro, we've had a super low success rate for full bill scrapes (like, once a week) since we increased the volume of requests by scraping both public and private bills. the 104 errors from legistar seem to be self resolving, i.e., we may see a lower rate of scraper failure if we incorporate retries when 104 occurs.

fgregg commented 5 years ago

The scraper classes all have retry logic builtin, as the classes derive from the scrapelib Scraper class.

All the sentry errors seems to be related to a place in the code where we don't use the scrapers request method, but use the requests library directly.

https://github.com/opencivicdata/python-legistar-scraper/blob/0f8a66ae7ad270dcfd48cee632335dde678aa11e/legistar/bills.py#L387-L388

So, why did we do this instead of self.head

If I recall correctly, this had to do with self.head acting kind of weirdly. I think this has to do with the caching portions of scrapelib.

I would recommend we try something like this.

from scrapelib import RetrySessions
...
retry_session = RetrySession()
legislation_detail_route = retry_session.head( 
gateway_url.format(matter_id)).headers['Location'] 
hancush commented 5 years ago

@fgregg i'd like to understand more about why self.head didn't work. this docstring seems like it could be instructive:

https://github.com/opencivicdata/python-legistar-scraper/blob/912a5ac244f7c1ed593eeb4578e9f24a46dee7a2/legistar/events.py#L14-L17

if we never want to cache certain pages, it looks like scrapelib provides a method to determine whether responses should be cached.

would it be appropriate to override that method so responses from the gateway_url are not cached? i could also see a more generalizable pattern like:

class LegistarScraper(LegistarSession, scrapelib.Scraper):

    NEVER_CACHE = []

    def should_cache_response(self, response):
        if any(response.url.startswith(url) for url in self.NEVER_CACHE):
            return False

        return super().should_cache_response(response)

where NEVER_CACHE is set on the typed scrapers, e.g., the events url on the events scraper, the gateway url on the bills scraper...

fgregg commented 5 years ago

sounds like a good approach. Definitely worth investigating why we didn't like self.head

On Mon, Apr 15, 2019 at 4:29 PM Hannah Cushman notifications@github.com wrote:

@fgregg https://github.com/fgregg i'd like to understand more about why self.head didn't work. this docstring seems like it could be instructive:

https://github.com/opencivicdata/python-legistar-scraper/blob/912a5ac244f7c1ed593eeb4578e9f24a46dee7a2/legistar/events.py#L14-L17

if we never want to cache certain pages, it looks like scrapelib provides a method to determine whether responses should be cached https://github.com/jamesturk/scrapelib/blob/dcae9fa86f1fdcc4b4e90dbca12c8063bcb36525/scrapelib/cache.py#L35 .

would it be appropriate to override that method so responses from the gateway_url are not cached? i could also see a more generalizable pattern like:

class LegistarScraper(LegistarSession, scrapelib.Scraper):

NEVER_CACHE = []

def should_cache_response(self, response):
    if any(resp.url.startswith(url) for url in self.NEVER_CACHE):
        return False

    return super().should_cache_response(response)

where NEVER_CACHE is set on the typed scrapers, e.g., the events url on the events scraper, the gateway url on the bills scraper...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencivicdata/python-legistar-scraper/issues/83#issuecomment-483426096, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgxbbCwZNU1d2rl2WWQjerwp0I3VM_gks5vhO8ggaJpZM4cwQnf .

hancush commented 5 years ago

i've done some testing and logging locally, and it looks like the approach will work. however, it also seems that the responses to head requests are not cached... so the approach won't help us here. :-)

going back to why we didn't like self.head, that line got changed from self to requests here: https://github.com/opencivicdata/python-legistar-scraper/commit/73ee7f2bdb142fe213ff87eef30e355414e45f60

looks like you were making some changes to skip retrying when the votes page didn't exist. i can't immediately tell how the head change is related... i know it's been a while, but does this commit, or the surrounding commits, jog your memory?

fgregg commented 5 years ago

it shoudn't be related. sorry, I can't recollect more