openstates / openstates-scrapers

source for Open States scrapers
https://openstates.org
GNU General Public License v3.0
846 stars 462 forks source link

[iOS] Standardize Address/Phone/etc keys for Legislators #100

Closed grgcombs closed 12 years ago

grgcombs commented 12 years ago

Trying to eek a little more usable info into the app and I'm running into an issue that certainly is a direct result of scraping from various state sites. Nevertheless, we need more uniformity for contact info in the legislator api.

Some examples of inconsistency in the keys that prove troublesome:

Here's a snapshot of these examples:

[
    "+capital_address": " P.O. Box 12068, Capitol Station\nAustin, TX 78711\n(512) 463-0110",
    "+district_address": " 2421 W. 7th, Suite 350, Building A\nFort Worth, TX 76107\n(817) 332-3338",
    "+address": "3271 E 1875 N LAYTON, UT 84040",
    "+office_loc": "410 Farnum Bldg",
    "email":"jsadams@utahsenate.org",
    "email": "vclark123@charter.netvalerie.clark@house.ga.gov", // included this one because emails are doubled
    "+email_address": "Jeremy.Taylor@legis.state.ia.us",
    "office_phone": " 404.656.0265",
    "+phone": "503-986-1702"
    "+district_phone": "(718) 822-2049",
    "+capitol_phone": "518-455-3595",
    "+phone_number": "Home (208) 684-5209",
    "+business_phone": "Bus (208) 522-8100",
    "+fax_number": "FAX (208) 522-1334",
    "+office_fax": "517-373-9320",
    "+website": "http://www.leg.state.or.us/atkinson"
]

Understandably, changing these keys could cause problems for folks that are expecting the old way, however, the current situation precludes us from including these data values in the app, at least in any meaningful way. If we can post-process the scraping to clean up the dictionary keys, I would propose something like the following:

[
    "email": "valerie.clark@house.ga.gov",
    "+email_other": "vclark123@charter.net",
    "website": "http://www1.legis.ga.gov/legis/2011_12/house/bios/clarkValerie.html",
    "+website_other": "http://voteforvalerieclark.com",
    "+website_other2": "http://www.some-other-linked-page.com",
    "capitol_address": "507 Coverdell Legislative Office Building\n Atlanta, Georgia 30334",
    "capitol_phone": "(404) 656-0202",
    "district_address": "252 Regal Drive\n Lawrenceville, Georgia 30046",
    "+district_address_other": "12312 E. Some Other Street\nSome Town, GA 32322",
    "+district_address_other2": "452 N. Some Avenue\nSome City, GA 30022",
    "district_phone": "(770) 314-0456",
    "+district_phone_other": "(404) 522-8100"
]

... And so on ... basically we fill the bin of standard keys (capitol_phone, website, email, etc) and then leftovers go into appendages, like "+KEY_other", "+KEY_other2", etc. If we're already mucking about in there, would it be possible to throw some cleanup on the values too? For instance, regex the phone numbers to grab our 10 digits and rewrite into a consistent format ..., also potentially cleaning up street addresses by stripping white space, dealing with phone numbers in the address values, capitalization, etc. Those are all positives, but the most important to me is a standardization on the dictionary keys.

jamesturk commented 12 years ago

re: standardizing key names, absolutely we can and should start doing this - I think I have someone well suited to the task that I can put on it in the next week or two

there's an item on my TODO list about detailed item-specific validation/cleanup hooks, I think that phone numbers, etc. would be perfect for this so perhaps it is time to focus on getting that working as well

grgcombs commented 12 years ago

Maybe in V2, we could construct and populate some standard office/contact dictionaries ...

{
  "id" : "NYL000002",
  "leg_id" : "NYL000002",
  "state" : "ny",
  "chamber" : "upper",
  "party" : "Democratic",
  "district" : "15",
  "full_name" : "Joseph P Addabbo Jr.",
  "photo_url" : "http://www.nysenate.gov/files/imagecache/senator_teaser/profile-pictures/Addabbo.SD15.jpg",
  "updated_at" : "2011-09-06 06:24:39",
  "created_at" : "2011-05-05 01:29:13",
  "emails" : [
    "addobbo@nysenate.gov",
    "info@voteaddobbo.com"
  ],
  "websites" : [
      "http://www.nysenate.gov/Addobbo",
      "http://www.voteaddobbo.com/"
  ],
  "offices" : [
    {
      "office_id" : "NYO000002",
      "leg_id" : "NYL000002",
      "type" : "capitol",
      "phone" : "(404) 656-0202",
      "fax" : "(404) 656-0203",
      "address" : "111 Broadway\n New York, New York 10046",
      "coordinates" : [-32.123123,101.2212]  // tee hee!
    },
    {
      "office_id" : "NYO000003",
      "leg_id" : "NYL000002",
      "type" : "district",
      "phone" : "(404) 555-1212",
      "fax" : nil,
      "address" : "88 Main St\n Syracuse, New York 10011",
      "coordinates" : [-7.320, 72.444]  // tee hee!
    }
  ]
}
grgcombs commented 12 years ago

On a related note, I've thrown together some preliminary thoughts on a V2 API via the Wiki

jamesturk commented 12 years ago

a new offices key has been added, it is possible to add offices in the scrape and they look like

{
   'type': 'capitol',  // will be capitol|district
   'name': 'Capitol Office',
   'fax': null,
   'phone': '202-555-0001',
   'address': '212 Maple Lane\nRaleigh, NC 27526',
   'email': null
}

need to add this in more states (LA was the experimental one for this)