openwichita / public-meetings

A service to show all upcoming available public meetings for the City of Wichita.
1 stars 4 forks source link

To scrape or not to scrape #4

Closed Mearnest closed 7 years ago

Mearnest commented 7 years ago

The Board of Education PDF schedule uses color with a visual calendar to denote meeting dates. This makes scraping it very challenging. You have to extract all the relevant style in addition to the text and it's location in the document. I was not able to do this despite trying several PDF scraping libraries and two converters (to html or xml), one of them written in C.

There is also a web calendar, but from just inspecting it, it also looks quite challenging to scrape.

Which brings up the issue of whether scraping online schedules is worth the effort. Even with adapting someone else's scraping code, you have to adjust the code for every single unique PDF and web page. They all present different challenges.

For something like a meetings schedule, which comes out once a year, perhaps it would be easier to just use data entry, and then perhaps ask the city to publish a friendlier data format in the future.

infernocloud commented 7 years ago

I think manual entry would be just fine at first. Maybe one of our in-person hack nights (or even a virtual hack night) can be devoted to populating events for the year (or whatever period is available).

I definitely think talking to the relevant groups about publishing data in a better format is a good goal. Maybe we can find a system that they can drop in to their existing workflow, or maybe what they already use for scheduling internally has a way to get it out better. Does anyone know some contacts that could answer these questions for the City, County, BoE, etc.?

sethetter commented 7 years ago

I think scraping where we can is great, but where it's going to be very difficult, it would be worthwhile to reach out to the person who publishes them online and see if they would be willing to make some minor adjustments to make scraping or importing easier.

Manual entry should always be available as a fallback, but we should automate as much as we can.