propublica / Capitol-Words

Scraping, parsing and indexing the daily Congressional Record to support phrase search over time, and by legislator and date
BSD 3-Clause "New" or "Revised" License
122 stars 34 forks source link

broken images for legislator photos #10

Closed sbma44 closed 13 years ago

sbma44 commented 13 years ago

Missing photo for Blanche Lincoln at http://capitolwords.org/legislator?chamber=&party=D&state=AR&congress=109

More missing photos at http://capitolwords.org/legislator?chamber=&party=R&state=MN&congress=105

Missing photo for Christensen at http://capitolwords.org/legislator?chamber=House&party=&state=&congress=112

sbma44 commented 13 years ago

have asked Eric for his take on why there might be two Blanch Lincolns at that first link. Given the duplicates, I'm guessing that this is either an API error or that we just need to filter the records that are coming back in some way.

konklone commented 13 years ago

All congressperson pictures are identified by bioguide ID. On the first link, the second "Blanche Lincoln" has a bioguide ID of L000555, which as it turns out, is a valid bioguide ID for her, but as a redirect to an alternate spelling of her name: http://bioguide.congress.gov/scripts/biodisplay.pl?index=L000555

On the second page, Rod Grams has been out of office since 2001, and our Legislator API doesn't go back that far (it goes back to, I believe, the 110th Congress that started in Jan of 2007): http://bioguide.congress.gov/scripts/biodisplay.pl?index=G000367

Same for Gilbert Gutknecht, he's just too old: http://bioguide.congress.gov/scripts/biodisplay.pl?index=G000536

On the third page, Christensen is the victim of another separate-ID-for-alternate-spelling: http://bioguide.congress.gov/scripts/biodisplay.pl?index=C001039

How do we end up with these alternate bioguide IDs? They don't appear in our Sunlight API, so I guess Capitol Words does name->bioguide resolution using an alternate source?

For the issue with older Congresspersons, this is going to be widespread if we're going back far enough. Here are two issues missing Grams, and also Tom Daschle: http://capitolwords.org/date/2000/12/14/S11774-2-serving-in-the-senate http://capitolwords.org/date/1998/09/17/S10501-2-sense-of-the-senate-regarding-puerto-rico

I think our choices there are either to do a very comprehensive retroactive update of our photo database, or find a cute "this user hasn't uploaded a profile picture!" image to use where we don't have one.

sbma44 commented 13 years ago

Thanks for looking at this so quickly! Yeah, Aaron filled me in shortly after I emailed you and told me that the project's wide date range forced him to use a custom solution. Sounds like either Javascript suppression or getting Tim to generate a list of 404s (though I doubt logging was turned on for that bucket) for generic image placement is the way to go. Unfortunately the S3 media hosting eliminates the possibility of a more elegant nginx fix.

Perhaps I'm missing something and there's a way to detect a missing photo in advance during page rendering, but I don't see how. I guess we could look the bioguide up against a list of all the photos we have -- maybe that's not too inefficient. Hacky, though.

konklone commented 13 years ago

I think your assessment is basically right, though generating a JavaScript array of known valid bioguide IDs from the legislator API is not hard to do and keep up to date if CW downloads the CSV of legislator info every night. If the bioguide ID from the CR isn't in the array, display the no-photo pic.

As to the alternate solution of filling out our photo database - James has some code to automatically fetch and correctly size legislator photos, I don't know how easily it could be adapted to serve older Congresses. It'd be neat if we could just take the list of distinct bioguide IDs that appear in the CapitolWords database and use it to generate a much more complete set of photos. It'd be fine to have photos of legislators that don't appear in the API.

drinks commented 13 years ago

Fixed in d0359d, current behavior is to hide images. A slight tweak could replace them with our 'no photo available' placeholder.