Closed pierceboggan closed 10 years ago
You can generally get this for current or recent members if you known the bioguide ID. If you look on the bioguide page for Mo Cowan, for instance, you'll see his photo at this URL:
http://bioguide.congress.gov/bioguide/photo/C/C001099.jpg
In whatever language you use, you then can construct the URL from something like:
"http://bioguide.congress.gov/bioguide/photo/" + bioguide[0] + "/" + bioguide + ".jpg"
There is not 100% coverage for current members, but it's a good start
Sunlight also offers a set of MoC photos, named by Bioguide ID, for download as a zip file. We normalize them into a bunch of different sizes, with the largest being 250x200.
Even though this project doesn't actually host the MoC photos, the little shell script Sunlight uses to do the resize work is in it here, and you could adapt it to your needs. Either way, you'd want to put the result into S3 or something.
Also, we get those photos from the Congressional Pictorial Directory, published by GPO, so they may not be the same as the ones on bioguide.congress.gov.
Also bulk data from GovTrack: https://www.govtrack.us/developers/data
I could probably add a has_photo field to the GovTrack API....
Awesome. Thanks guys!
We had a terrific thread over at https://github.com/sunlightlabs/congress/issues/432#issuecomment-34554026 on this, and (after @mwweinberg picked up the phone and called the GPO), the resolution was to make a scraper for the GPO's Member Guide, and then offer the photos for download.
I'm updating this ticket's description to reflect this. Does anyone have any objection to adding the images to this repository, or should they go somewhere else? I think it's convenient to have them here, since it's in scope for the repo.
For reference, 812 200x250
JPGs of members currently takes up ~23MB. The 100x125
versions take up 4.8M, and the 40x50
versions 3.6M. All 3 folders, plus .zip
files for each one, take up 56M in total.
The versions on the Member Guide are 231x281
, which we could keep as originals (increasing disk use), or just continue making 200x250 versions and ditch the originals.
/cc @JoshData @sbma44
One additional idea: we could potentially store the photos in a gh-pages
branch, and produce permalinks to each photo.
There are high-res images over there, I think.
I don't know whether we really need to store them in a repo (vs just a scraper), but if we do I'd strongly prefer a separate repo for it.
I think it might be a neat, low-maintenance thing to version the images. But a separate repo is fine, and makes experimentation easier.
Do you know how to get the high-res images?
They were hires in the Wikipedia links, and the DOM inspector seemed to indicate they were bigger than on the site as displayed, but I didn't get ANY image when hitting the image URL so that's as far as I got.
OK, so I did:
wget "http://memberguide.gpo.gov/ReadLibraryItem.ashx?SFN=iqw/hTCdweheEMFH1iwn0bt5yckfRo6E2eA2JdiV4F5SafjBF0U12w==&I=1MKI2SYWd4A="`
and that got me a 589x719
version of Vance McAllister, 86.6K in size.
A quick search to see if anyone has written a scraper before for this found nothing, but found this PDF of "Grading the Government’s Data Publication Practices" by Jim Harper, which you may find interesting (although it may already be familiar to you).
"As noted above, the other ways of learning about House and Senate membership are ad hoc. The Government Printing Office has a “Guide to House and Senate Members” at http://memberguide.gpo.gov/ that duplicates information found elsewhere. The House website presents a list of members along with district information, party affiliation, and so on, in HTML format (http://www.house.gov/representatives/), and beta.congress.gov does as well (http://beta.congress.gov/members/). Someone who wants a complete dataset must collect data from these sources using a computer program to scrape the data and through manual curation. The HTML presentations do not break out key information in ways useful for computers. The Senate membership page, on the other hand, includes a link to an XML representation that is machine readable. That is the reason why the Senate scores so well compared to the House."
http://www.cato.org/pubs/pas/PA711.pdf
http://beta.congress.gov/members is nicely scrapable (I wonder if they have an API), but then some images are missing, and we are back wondering about the copyright.
The mobile memberguide site is very scrapable, but the images are hosted on m.gpo.gov and are only a lo-res image and a lower-res thumbnail. http://m.gpo.gov/memberguide/
But if wget works on memberguide.gpo.gov then that is a good start. As it happens, Wikipedia uses the same image of Vance McAllister, and their original file is also 589 × 719. https://en.wikipedia.org/wiki/File:Vance_McAllister.jpg
"(I wonder if they have an API)"
Welcome to the world of legislative data. A fantastic and frustrating world awaits. :)
Thanks for doing the research on getting the images, btw.
That is solid research. I think a new scraper, for the normal (non-mobile) member guide is what's called for, to get maximum size and the greatest guarantee of public domain.
The normal member guide is defaulting to the 112nd congress. It can be downloaded with wget
wget "http://memberguide.gpo.gov/GetMembersSearch.aspx"
I think we should be able to table with POST commands. For example, to select 113th congress with wget:
wget "http://memberguide.gpo.gov/GetMembersSearch.aspx" --post-data 'ctl00$ContentPlaceHolder1$ddlCongressSession=113'
But I've not got it working yet. Anyway, that page has links to each member's own page, with the photo.
Some other options:
ctl00$ContentPlaceHolder1$ddlMemberType:Select Member Type
ctl00$ContentPlaceHolder1$ddlParty:All
ctl00$ContentPlaceHolder1$ddlCongressSession:113
ctl00$ContentPlaceHolder1$ddlSearchIn:Select Category
ctl00$ContentPlaceHolder1$ddlSearchCriteria:Like
ctl00$ContentPlaceHolder1$txtSearch:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl00$GoToPageTextBox:1
ctl00_ContentPlaceHolder1_Memberstelerikrid_ctl00_ctl02_ctl00_GoToPageTextBox_ClientState:{"enabled":true,"emptyMessage":"","validationText":"1","valueAsString":"1","minValue":1,"maxValue":12,"lastSetTextBoxValue":"1"}
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl00$ChangePageSizeTextBox:50
ctl00_ContentPlaceHolder1_Memberstelerikrid_ctl00_ctl02_ctl00_ChangePageSizeTextBox_ClientState:{"enabled":true,"emptyMessage":"","validationText":"50","valueAsString":"50","minValue":1,"maxValue":554,"lastSetTextBoxValue":"50"}
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnLastName:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnStateDescription:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnMemberType:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnPartyDescription:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnHometown:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnDistrict:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnTermCount:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl03$ctl01$GoToPageTextBox:1
ctl00_ContentPlaceHolder1_Memberstelerikrid_ctl00_ctl03_ctl01_GoToPageTextBox_ClientState:{"enabled":true,"emptyMessage":"","validationText":"1","valueAsString":"1","minValue":1,"maxValue":12,"lastSetTextBoxValue":"1"}
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl03$ctl01$ChangePageSizeTextBox:50
ctl00_ContentPlaceHolder1_Memberstelerikrid_ctl00_ctl03_ctl01_ChangePageSizeTextBox_ClientState:{"enabled":true,"emptyMessage":"","validationText":"50","valueAsString":"50","minValue":1,"maxValue":554,"lastSetTextBoxValue":"50"}
ctl00_ContentPlaceHolder1_Memberstelerikrid_rfltMenu_ClientState:
ctl00_ContentPlaceHolder1_Memberstelerikrid_ClientState:
I think a wget
of the whole site will not be enough, because we need to be able to associate Bioguide IDs with the downloaded pictures (and name them after them). So, the scraper should run over the member guide, use the congress-legislators data to resolve the names, and name the downloaed image after the Bioguide.
The couple of member pages I checked handily have a link to a bio page, which contains the Bioguide ID in the URL, for example:
<a id="ctl00_ContentPlaceHolder1_lnkBioguide" title="Click to View Biographical Informaion" class="iconbio" href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000022/" target="_blank">Biographical Information</a>
Hmm, I see that, though when I checked a recent member's page for Vance McAllister, it didn't have one. So I think it's probably better to resolve using congress-legislators data, using the last name, state, and chamber for disambiguation where needed. When matching against legislators who served a term in a particular Congress only, that should be pretty doable.
It's probably just because McAllister is new. He also has no photo. Don't throw the baby (bioguide IDs) out with the bathwater (McAllister)!
There's some strange monkeybusiness in their code I think. For a while I wasn't getting reliable images even when using the same URL -- sometimes it produced photos of different legislators than the one I thought I was selecting, sometimes different resolutions, sometimes placeholder images. I suspect they're doing something stupid with session variables. This is pretty easy to verify given that a bare curl of the image src generally doesn't return the right photo.
FWIW a working curl invocation (taken from chrome's network tab) is below, and there isn't too much to it. My testing makes me think referer is probably irrelevant but I'm not 100% sure. I suspect you are going to have to establish the session cookie, though, and perhaps grab each legislator page's HTML before attempting to grab the image. I could be wrong about this, but something weird seems to be going on.
curl 'http://memberguide.gpo.gov/ReadLibraryItem.ashx?SFN=EwZsCCzU55gKCia/s41Zv1J0gtcl+Ev+&I=1MKI2SYWd4A=' -H 'DNT: 1' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_91) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36' -H 'Accept: image/webp,/_;q=0.8' -H 'Referer: http://memberguide.gpo.gov/112/SR/Alexander' -H 'Cookie: AspxAutoDetectCookieSupport=1; ASP.NET_SessionId=gxscujd2iai1z1nheatk0gy3' -H 'Connection: keep-alive' -H 'Cache-Control: max-age=0' --compressed
Hmm, @JoshData, McAllister had a photo yesterday, his was the example I used yesterday. And I strongly suspect it has to do with what @sbma44 is isolating.
But I wonder if it's time-based rather than session-based? Because yesterday, I ran:
wget "http://memberguide.gpo.gov/ReadLibraryItem.ashx?SFN=iqw/hTCdweheEMFH1iwn0bt5yckfRo6E2eA2JdiV4F5SafjBF0U12w==&I=1MKI2SYWd4A="
And it gave me a photo for McAllister. Now, that exact command downloads a photo that says "No photo". I don't see why wget
would have had access to my browser's session or anything yesterday.
This is malarkey!
I ran the exact same command yesterday and got the photo. Today, no photo.
They're using cookies and sessions. Must be something in that URL. See also the ASP.NET_SessionId
in the curl
command.
I think we need to go in through the front door and proceed from there.
Yeah, but a straight wget
without any known sessions or cookies also downloads the same file @sbma44 got with curl
:
wget "http://memberguide.gpo.gov/ReadLibraryItem.ashx?SFN=EwZsCCzU55gKCia/s41Zv1J0gtcl+Ev+&I=1MKI2SYWd4A="
and now the McAllister image works with wget
too. 15 minutes ago, it didn't -- and the Member Guide was showing the same No Photo image. Now McAllister's page is back to normal and shows his photo.
Now I'm wondering if you can reliably wget
any photo from these pages, as long as you've "warmed" them somehow, maybe by fetching the page with a browser or curl
with the options @sbma44 used, and waiting a minute.
Maybe just a wget
or curl
of the front page will do it...
By all means, see if you can get that working -- if it works, that'd be the easiest method.
Wow. This is one of the wackiest web scraping situations I've seen.
There's going to be an opportunity to bug GPO about things in a few weeks. One of us can bring it up, or @hugovk if you're in the DC area we can let you know how to come bug GPO in person.
@JoshData Thanks, but I'm in the Helsinki area :)
I've discovered Python's mechanize
package that's able to get the form, change values and resubmit. it can find and open each of the member links. With the help of BeautifulSoup we can find the img
tags on each page and download them.
These were useful: https://blog.scraperwiki.com/2011/11/how-to-get-along-with-an-asp-webpage/ https://views.scraperwiki.com/run/python_mechanize_cheat_sheet/
I've made a first proto version here: https://github.com/hugovk/congress-legislators/blob/master/scripts/gpo_member_photos.py
Still to do: When there's no known bioguide ID on the page, resolve it from something like legislators.csv
(Side note: after lots of testing, the http://www.memberguide.gpoaccess.gov/GetMembersSearch.aspx for page has been showing blank for me, in a browser and in code. A bit later it worked again, but now it's blank again. Perhaps there's some anti-scrape IP filtering. This may or may not be a problem in normal use, but perhaps some [random] delays will help.)
Aweeesssoommme. A few thoughts:
Yes, some rate limiting would probably keep the IP blocking in check. And since the script is running inside the congress-legislators repo, the easiest thing to is use the utils.load_data
method to load in the current and historical legislator data as an array of dicts, rather than resorting to a CSV. You can probably get away with just the current data, if the missing bioguide IDs are only for newer members.
There are also some utils
methods for dealing with command line args, you can cut out some of the argparse code with it if you like.
You may want to add some caching logic, so that if the page has already been downloaded, it doesn't need to fetch it again. There's a utils.download
method that makes that easy.
For now, it should output legislator images to a non-versioned directory -- I can handle making a new repo and moving it there (and migrating some utils
methods along with it).
I've just added what I'd done before your comment:
This loads the YAML into an array of dicts rather than using the CSV. You're right, it's much easier that way. I added some argparse
as well so images are downloaded to a non-versioned directory.
If the Bioguide ID isn't found in the member page, it's resolved against the YAML data.
resolve()
resolves in this sort of order:
It didn't resolve Bioguide IDs for four people.
The GPO data should be fixed (how to report?), but should we add a final resolution case for switched names?
These three aren't in the YAML:
Chiesa left in 2013, Radel left in 2014, Young died in 2013 so have all been removed from legislators-current.yaml. I've just spotted legislators-historical.yaml. We could use this, but there'll be more risk of matching the wrong person. I suppose some year matching could be implemented, plus reverse sorting the YAML list of dicts.
test_gpo_member_photos.py uses legislators-test.yaml, a subset of legislators-current.yaml, to unit test things like Bioguide ID matching and validation and Bioguide resolution.
Run it like python test_gpo_member_photos.py
TODO: Add caching of downloaded pages, rate limiting. Some other TODOs in the file.
It's okay to add some hard coding for a tiny handful, or for it to miss them but print out a warning that only shows up if we're actually missing the photo. In practice, we'll have most of the photos when it's run, and in version control.
OK, I've added a hardcoding for BB.
I've also added: Added caching of member pages. Added check in case front page doesn't load, possibly due to rate limiting.
https://github.com/hugovk/congress-legislators/commit/721605b6db3538abb746a254fb97f554ef07586a
Very cool. Want to submit this as a PR to this project, since you've got it in your fork? I can migrate it to a new repo and give you write access from there.
I've submitted PR #167.
If there's any other useful data in the member guide, this code could be easily adapted to scrape it. Python's mechanize and BeautifulSoup are very useful!
Closed by #167.
Use the GPO's Member Guide to fetch images of each member of Congress, and store the results here.
Reference discussion on this approach, and copyright issues: https://github.com/sunlightlabs/congress/issues/432
Original ticket:
Hi,
It's extremely difficult to pull an image from somewhere that is reliable. Additionally, after grabbing the congressman, I would need to make an additional call to search for a profile image based on the representatives returned. It would be great if this was a field in the JSON response, perhaps a URL that we can go out and grab the image from!
Keep up the good work!