opencivicdata / python-legistar-scraper

Scrapes municipal data from Legistar websites
BSD 3-Clause "New" or "Revised" License
42 stars 28 forks source link

The API improves, may be able to get away with not scraping web #92

Closed fgregg closed 4 years ago

fgregg commented 5 years ago

The legistar webapi has some new features that may obviate the need for so much scraping of the web.

So far, I've noticed that event endpoints have a comment field and a link to the event web url.

skorasaurus commented 5 years ago

Have you found a changelog anywhere (or does Granicus never even release one?)

fgregg commented 5 years ago

I haven't.

fgregg commented 5 years ago

It appears that there are still some resources that are still only available from the InSIte web pages; for example, Audio Links in LA Metro.

However, there seems to always be a link to the InSite url in the API entries. So, we could get rid of _scrapeWebCalendar and instead do something like this in

def events(...):
    ...
    for api_event in self.api_events(since_datetime):
        ...
        web_event = self.web_scraper.event_detail(api_event['EventInSiteURL'])

That is, confidently visit the InSite detail page for an event only when we need to.

This should give us a more confidence than the approach of trying to connect events scraped in two ways. It should also make for faster scrapes as we'll visit fewer pages overall.

hancush commented 4 years ago

Done in #93.