typpo / ca-property-tax

CA property tax visualization
https://www.officialdata.org/ca-property-tax/
GNU Affero General Public License v3.0
88 stars 17 forks source link

Create reusable Scraping and Parsing classes #18

Open jamesshannon opened 3 years ago

jamesshannon commented 3 years ago

There is a lot of repeated code across each county's scraper and parser scripts. And lots of code that should be repeated but isn't (like retrying on transient network failures). Additionally, there are a handful of counties which use shared systems (e.g., https://www.mptsweb.net/).

Ideally, a county script would instantiate a class with a few variables (CSV file location, URL template, etc) and define a parse_html() function and call a method which takes care of everything else.

I'm working on this as part of Placer (#17). I'm creating this issue to track and discuss the work.

@typpo One question I have so far is related to my Placer work. You recommend the geojson script step. It seems like it'd be easier to do this in python (with, e.g., pyshp) to minimize the number of steps that someone has to follow. Have you found that the geojson script is better for one reason or another?

typpo commented 3 years ago

Reusable classes would be very useful! Thanks for getting this started.

Using pyshp would be nicer and cleaner than the geojson conversion. I'm in the habit of converting to geojson first just so I can see what type of data is in the shapefile (for example: is there all the required address info? Is there zoning info? Does it use latlng or XY coordinates).