Add a legislator_names array with raw extracted names ("Mr. Price (GA)", "Mr. Stevens", etc.) for each clip, and one aggregated one for the top-level object that has all names mentioned in the clips.
Add a bioguide_ids array with matched bioguide IDs ("L000551", etc.) for each clip, that are determined by the extracted names. Err on the side of including too many bioguide IDs - so if the clip mentions "Mr. Smith" and that matches 3 people, add all 3 of their bioguide IDs to the array, to be safe. As you said, false positives are better than not matching at all. Add an array to the top-level object as well, that has the unique bioguide_ids for all clips.
I'll make sure there's an index on all 4 array fields - "bioguide_ids", "legislator_names", "clips.bioguide_ids", and "clips.legislator_names". Mongo takes care of indexing arrays and fields inside of arrays.
You can scope matching for particular names by chamber, so you only need to look for "Mr. Price" among legislators whose chamber field is "house".
But bear in mind that we can't just match on legislators whose in_office field is true, as legislators may go in and out of office mid-session, and as we transition to the 112th session our database will have multiple sessions.
(It's my hope that eventually our Congress API will evolve to maintain a range of when people were in office, which would help us make more precise choices in our other projects, too.)
Add a legislator_names array with raw extracted names ("Mr. Price (GA)", "Mr. Stevens", etc.) for each clip, and one aggregated one for the top-level object that has all names mentioned in the clips.
Add a bioguide_ids array with matched bioguide IDs ("L000551", etc.) for each clip, that are determined by the extracted names. Err on the side of including too many bioguide IDs - so if the clip mentions "Mr. Smith" and that matches 3 people, add all 3 of their bioguide IDs to the array, to be safe. As you said, false positives are better than not matching at all. Add an array to the top-level object as well, that has the unique bioguide_ids for all clips.
I'll make sure there's an index on all 4 array fields - "bioguide_ids", "legislator_names", "clips.bioguide_ids", and "clips.legislator_names". Mongo takes care of indexing arrays and fields inside of arrays.
You can scope matching for particular names by chamber, so you only need to look for "Mr. Price" among legislators whose chamber field is "house".
But bear in mind that we can't just match on legislators whose in_office field is true, as legislators may go in and out of office mid-session, and as we transition to the 112th session our database will have multiple sessions.
(It's my hope that eventually our Congress API will evolve to maintain a range of when people were in office, which would help us make more precise choices in our other projects, too.)