propublica / sunlight-congress

The Sunlight Foundation's Congress API. Shut down on Oct. 1, 2017.
https://www.propublica.org/nerds/item/congress-api-bill-subjects-personal-explanations-and-sunsetting-sunlight
Other
169 stars 65 forks source link

Pluck out legislator_names and bioguide_ids from clip description #27

Closed konklone closed 13 years ago

konklone commented 14 years ago

Add a legislator_names array with raw extracted names ("Mr. Price (GA)", "Mr. Stevens", etc.) for each clip, and one aggregated one for the top-level object that has all names mentioned in the clips.

Add a bioguide_ids array with matched bioguide IDs ("L000551", etc.) for each clip, that are determined by the extracted names. Err on the side of including too many bioguide IDs - so if the clip mentions "Mr. Smith" and that matches 3 people, add all 3 of their bioguide IDs to the array, to be safe. As you said, false positives are better than not matching at all. Add an array to the top-level object as well, that has the unique bioguide_ids for all clips.

I'll make sure there's an index on all 4 array fields - "bioguide_ids", "legislator_names", "clips.bioguide_ids", and "clips.legislator_names". Mongo takes care of indexing arrays and fields inside of arrays.

You can scope matching for particular names by chamber, so you only need to look for "Mr. Price" among legislators whose chamber field is "house".

But bear in mind that we can't just match on legislators whose in_office field is true, as legislators may go in and out of office mid-session, and as we transition to the 112th session our database will have multiple sessions.

(It's my hope that eventually our Congress API will evolve to maintain a range of when people were in office, which would help us make more precise choices in our other projects, too.)

kaitlin commented 13 years ago

Think this is all set there