unitedstates / congress-legislators

Members of the United States Congress, 1789-Present, in YAML/JSON/CSV, as well as committees, presidents, and vice presidents.
Creative Commons Zero v1.0 Universal
2.07k stars 507 forks source link

Image for congressman #160

Closed pierceboggan closed 10 years ago

pierceboggan commented 10 years ago

Use the GPO's Member Guide to fetch images of each member of Congress, and store the results here.

Reference discussion on this approach, and copyright issues: https://github.com/sunlightlabs/congress/issues/432

Original ticket:

Hi,

It's extremely difficult to pull an image from somewhere that is reliable. Additionally, after grabbing the congressman, I would need to make an additional call to search for a profile image based on the representatives returned. It would be great if this was a field in the JSON response, perhaps a URL that we can go out and grab the image from!

Keep up the good work!

wilson428 commented 10 years ago

You can generally get this for current or recent members if you known the bioguide ID. If you look on the bioguide page for Mo Cowan, for instance, you'll see his photo at this URL:

http://bioguide.congress.gov/bioguide/photo/C/C001099.jpg

In whatever language you use, you then can construct the URL from something like:

"http://bioguide.congress.gov/bioguide/photo/" + bioguide[0] + "/" + bioguide + ".jpg"

There is not 100% coverage for current members, but it's a good start

konklone commented 10 years ago

Sunlight also offers a set of MoC photos, named by Bioguide ID, for download as a zip file. We normalize them into a bunch of different sizes, with the largest being 250x200.

Even though this project doesn't actually host the MoC photos, the little shell script Sunlight uses to do the resize work is in it here, and you could adapt it to your needs. Either way, you'd want to put the result into S3 or something.

konklone commented 10 years ago

Also, we get those photos from the Congressional Pictorial Directory, published by GPO, so they may not be the same as the ones on bioguide.congress.gov.

JoshData commented 10 years ago

Also bulk data from GovTrack: https://www.govtrack.us/developers/data

I could probably add a has_photo field to the GovTrack API....

pierceboggan commented 10 years ago

Awesome. Thanks guys!

konklone commented 10 years ago

We had a terrific thread over at https://github.com/sunlightlabs/congress/issues/432#issuecomment-34554026 on this, and (after @mwweinberg picked up the phone and called the GPO), the resolution was to make a scraper for the GPO's Member Guide, and then offer the photos for download.

I'm updating this ticket's description to reflect this. Does anyone have any objection to adding the images to this repository, or should they go somewhere else? I think it's convenient to have them here, since it's in scope for the repo.

For reference, 812 200x250 JPGs of members currently takes up ~23MB. The 100x125 versions take up 4.8M, and the 40x50 versions 3.6M. All 3 folders, plus .zip files for each one, take up 56M in total.

The versions on the Member Guide are 231x281, which we could keep as originals (increasing disk use), or just continue making 200x250 versions and ditch the originals.

/cc @JoshData @sbma44

konklone commented 10 years ago

One additional idea: we could potentially store the photos in a gh-pages branch, and produce permalinks to each photo.

JoshData commented 10 years ago

There are high-res images over there, I think.

I don't know whether we really need to store them in a repo (vs just a scraper), but if we do I'd strongly prefer a separate repo for it.

konklone commented 10 years ago

I think it might be a neat, low-maintenance thing to version the images. But a separate repo is fine, and makes experimentation easier.

Do you know how to get the high-res images?

JoshData commented 10 years ago

They were hires in the Wikipedia links, and the DOM inspector seemed to indicate they were bigger than on the site as displayed, but I didn't get ANY image when hitting the image URL so that's as far as I got.

konklone commented 10 years ago

OK, so I did:

wget "http://memberguide.gpo.gov/ReadLibraryItem.ashx?SFN=iqw/hTCdweheEMFH1iwn0bt5yckfRo6E2eA2JdiV4F5SafjBF0U12w==&I=1MKI2SYWd4A="` 

and that got me a 589x719 version of Vance McAllister, 86.6K in size.

hugovk commented 10 years ago

A quick search to see if anyone has written a scraper before for this found nothing, but found this PDF of "Grading the Government’s Data Publication Practices" by Jim Harper, which you may find interesting (although it may already be familiar to you).

"As noted above, the other ways of learning about House and Senate membership are ad hoc. The Government Printing Office has a “Guide to House and Senate Members” at http://memberguide.gpo.gov/ that duplicates information found elsewhere. The House website presents a list of members along with district information, party affiliation, and so on, in HTML format (http://www.house.gov/representatives/), and beta.congress.gov does as well (http://beta.congress.gov/members/). Someone who wants a complete dataset must collect data from these sources using a computer program to scrape the data and through manual curation. The HTML presentations do not break out key information in ways useful for computers. The Senate membership page, on the other hand, includes a link to an XML representation that is machine readable. That is the reason why the Senate scores so well compared to the House."

http://www.cato.org/pubs/pas/PA711.pdf

http://beta.congress.gov/members is nicely scrapable (I wonder if they have an API), but then some images are missing, and we are back wondering about the copyright.

The mobile memberguide site is very scrapable, but the images are hosted on m.gpo.gov and are only a lo-res image and a lower-res thumbnail. http://m.gpo.gov/memberguide/

But if wget works on memberguide.gpo.gov then that is a good start. As it happens, Wikipedia uses the same image of Vance McAllister, and their original file is also 589 × 719. https://en.wikipedia.org/wiki/File:Vance_McAllister.jpg

JoshData commented 10 years ago

"(I wonder if they have an API)"

Welcome to the world of legislative data. A fantastic and frustrating world awaits. :)

Thanks for doing the research on getting the images, btw.

konklone commented 10 years ago

That is solid research. I think a new scraper, for the normal (non-mobile) member guide is what's called for, to get maximum size and the greatest guarantee of public domain.

hugovk commented 10 years ago

The normal member guide is defaulting to the 112nd congress. It can be downloaded with wget

wget "http://memberguide.gpo.gov/GetMembersSearch.aspx"

I think we should be able to table with POST commands. For example, to select 113th congress with wget:

wget "http://memberguide.gpo.gov/GetMembersSearch.aspx" --post-data 'ctl00$ContentPlaceHolder1$ddlCongressSession=113'

But I've not got it working yet. Anyway, that page has links to each member's own page, with the photo.

Some other options:

ctl00$ContentPlaceHolder1$ddlMemberType:Select Member Type
ctl00$ContentPlaceHolder1$ddlParty:All
ctl00$ContentPlaceHolder1$ddlCongressSession:113
ctl00$ContentPlaceHolder1$ddlSearchIn:Select Category
ctl00$ContentPlaceHolder1$ddlSearchCriteria:Like
ctl00$ContentPlaceHolder1$txtSearch:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl00$GoToPageTextBox:1
ctl00_ContentPlaceHolder1_Memberstelerikrid_ctl00_ctl02_ctl00_GoToPageTextBox_ClientState:{"enabled":true,"emptyMessage":"","validationText":"1","valueAsString":"1","minValue":1,"maxValue":12,"lastSetTextBoxValue":"1"}
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl00$ChangePageSizeTextBox:50
ctl00_ContentPlaceHolder1_Memberstelerikrid_ctl00_ctl02_ctl00_ChangePageSizeTextBox_ClientState:{"enabled":true,"emptyMessage":"","validationText":"50","valueAsString":"50","minValue":1,"maxValue":554,"lastSetTextBoxValue":"50"}
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnLastName:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnStateDescription:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnMemberType:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnPartyDescription:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnHometown:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnDistrict:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl02$ctl02$FilterTextBox_TemplateColumnTermCount:
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl03$ctl01$GoToPageTextBox:1
ctl00_ContentPlaceHolder1_Memberstelerikrid_ctl00_ctl03_ctl01_GoToPageTextBox_ClientState:{"enabled":true,"emptyMessage":"","validationText":"1","valueAsString":"1","minValue":1,"maxValue":12,"lastSetTextBoxValue":"1"}
ctl00$ContentPlaceHolder1$Memberstelerikrid$ctl00$ctl03$ctl01$ChangePageSizeTextBox:50
ctl00_ContentPlaceHolder1_Memberstelerikrid_ctl00_ctl03_ctl01_ChangePageSizeTextBox_ClientState:{"enabled":true,"emptyMessage":"","validationText":"50","valueAsString":"50","minValue":1,"maxValue":554,"lastSetTextBoxValue":"50"}
ctl00_ContentPlaceHolder1_Memberstelerikrid_rfltMenu_ClientState:
ctl00_ContentPlaceHolder1_Memberstelerikrid_ClientState:
konklone commented 10 years ago

I think a wget of the whole site will not be enough, because we need to be able to associate Bioguide IDs with the downloaded pictures (and name them after them). So, the scraper should run over the member guide, use the congress-legislators data to resolve the names, and name the downloaed image after the Bioguide.

hugovk commented 10 years ago

The couple of member pages I checked handily have a link to a bio page, which contains the Bioguide ID in the URL, for example:

<a id="ctl00_ContentPlaceHolder1_lnkBioguide" title="Click to View Biographical Informaion" class="iconbio" href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000022/" target="_blank">Biographical Information</a>

konklone commented 10 years ago

Hmm, I see that, though when I checked a recent member's page for Vance McAllister, it didn't have one. So I think it's probably better to resolve using congress-legislators data, using the last name, state, and chamber for disambiguation where needed. When matching against legislators who served a term in a particular Congress only, that should be pretty doable.

JoshData commented 10 years ago

It's probably just because McAllister is new. He also has no photo. Don't throw the baby (bioguide IDs) out with the bathwater (McAllister)!

sbma44 commented 10 years ago

There's some strange monkeybusiness in their code I think. For a while I wasn't getting reliable images even when using the same URL -- sometimes it produced photos of different legislators than the one I thought I was selecting, sometimes different resolutions, sometimes placeholder images. I suspect they're doing something stupid with session variables. This is pretty easy to verify given that a bare curl of the image src generally doesn't return the right photo.

FWIW a working curl invocation (taken from chrome's network tab) is below, and there isn't too much to it. My testing makes me think referer is probably irrelevant but I'm not 100% sure. I suspect you are going to have to establish the session cookie, though, and perhaps grab each legislator page's HTML before attempting to grab the image. I could be wrong about this, but something weird seems to be going on.

curl 'http://memberguide.gpo.gov/ReadLibraryItem.ashx?SFN=EwZsCCzU55gKCia/s41Zv1J0gtcl+Ev+&I=1MKI2SYWd4A=' -H 'DNT: 1' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_91) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36' -H 'Accept: image/webp,/_;q=0.8' -H 'Referer: http://memberguide.gpo.gov/112/SR/Alexander' -H 'Cookie: AspxAutoDetectCookieSupport=1; ASP.NET_SessionId=gxscujd2iai1z1nheatk0gy3' -H 'Connection: keep-alive' -H 'Cache-Control: max-age=0' --compressed

konklone commented 10 years ago

Hmm, @JoshData, McAllister had a photo yesterday, his was the example I used yesterday. And I strongly suspect it has to do with what @sbma44 is isolating.

But I wonder if it's time-based rather than session-based? Because yesterday, I ran:

wget "http://memberguide.gpo.gov/ReadLibraryItem.ashx?SFN=iqw/hTCdweheEMFH1iwn0bt5yckfRo6E2eA2JdiV4F5SafjBF0U12w==&I=1MKI2SYWd4A="

And it gave me a photo for McAllister. Now, that exact command downloads a photo that says "No photo". I don't see why wget would have had access to my browser's session or anything yesterday.

This is malarkey!

hugovk commented 10 years ago

I ran the exact same command yesterday and got the photo. Today, no photo.

They're using cookies and sessions. Must be something in that URL. See also the ASP.NET_SessionId in the curl command.

I think we need to go in through the front door and proceed from there.

konklone commented 10 years ago

Yeah, but a straight wget without any known sessions or cookies also downloads the same file @sbma44 got with curl:

wget "http://memberguide.gpo.gov/ReadLibraryItem.ashx?SFN=EwZsCCzU55gKCia/s41Zv1J0gtcl+Ev+&I=1MKI2SYWd4A="

and now the McAllister image works with wget too. 15 minutes ago, it didn't -- and the Member Guide was showing the same No Photo image. Now McAllister's page is back to normal and shows his photo.

Now I'm wondering if you can reliably wget any photo from these pages, as long as you've "warmed" them somehow, maybe by fetching the page with a browser or curl with the options @sbma44 used, and waiting a minute.

hugovk commented 10 years ago

Maybe just a wget or curl of the front page will do it...

konklone commented 10 years ago

By all means, see if you can get that working -- if it works, that'd be the easiest method.

JoshData commented 10 years ago

Wow. This is one of the wackiest web scraping situations I've seen.

There's going to be an opportunity to bug GPO about things in a few weeks. One of us can bring it up, or @hugovk if you're in the DC area we can let you know how to come bug GPO in person.

hugovk commented 10 years ago

@JoshData Thanks, but I'm in the Helsinki area :)

I've discovered Python's mechanize package that's able to get the form, change values and resubmit. it can find and open each of the member links. With the help of BeautifulSoup we can find the img tags on each page and download them.

These were useful: https://blog.scraperwiki.com/2011/11/how-to-get-along-with-an-asp-webpage/ https://views.scraperwiki.com/run/python_mechanize_cheat_sheet/

I've made a first proto version here: https://github.com/hugovk/congress-legislators/blob/master/scripts/gpo_member_photos.py

Still to do: When there's no known bioguide ID on the page, resolve it from something like legislators.csv

(Side note: after lots of testing, the http://www.memberguide.gpoaccess.gov/GetMembersSearch.aspx for page has been showing blank for me, in a browser and in code. A bit later it worked again, but now it's blank again. Perhaps there's some anti-scrape IP filtering. This may or may not be a problem in normal use, but perhaps some [random] delays will help.)

konklone commented 10 years ago

Aweeesssoommme. A few thoughts:

Yes, some rate limiting would probably keep the IP blocking in check. And since the script is running inside the congress-legislators repo, the easiest thing to is use the utils.load_data method to load in the current and historical legislator data as an array of dicts, rather than resorting to a CSV. You can probably get away with just the current data, if the missing bioguide IDs are only for newer members.

There are also some utils methods for dealing with command line args, you can cut out some of the argparse code with it if you like.

You may want to add some caching logic, so that if the page has already been downloaded, it doesn't need to fetch it again. There's a utils.download method that makes that easy.

For now, it should output legislator images to a non-versioned directory -- I can handle making a new repo and moving it there (and migrating some utils methods along with it).

hugovk commented 10 years ago

I've just added what I'd done before your comment:

This loads the YAML into an array of dicts rather than using the CSV. You're right, it's much easier that way. I added some argparse as well so images are downloaded to a non-versioned directory.

If the Bioguide ID isn't found in the member page, it's resolved against the YAML data.

resolve() resolves in this sort of order:

It didn't resolve Bioguide IDs for four people.

The GPO data should be fixed (how to report?), but should we add a final resolution case for switched names?

These three aren't in the YAML:

Chiesa left in 2013, Radel left in 2014, Young died in 2013 so have all been removed from legislators-current.yaml. I've just spotted legislators-historical.yaml. We could use this, but there'll be more risk of matching the wrong person. I suppose some year matching could be implemented, plus reverse sorting the YAML list of dicts.

test_gpo_member_photos.py uses legislators-test.yaml, a subset of legislators-current.yaml, to unit test things like Bioguide ID matching and validation and Bioguide resolution.

Run it like python test_gpo_member_photos.py

TODO: Add caching of downloaded pages, rate limiting. Some other TODOs in the file.

konklone commented 10 years ago

It's okay to add some hard coding for a tiny handful, or for it to miss them but print out a warning that only shows up if we're actually missing the photo. In practice, we'll have most of the photos when it's run, and in version control.

hugovk commented 10 years ago

OK, I've added a hardcoding for BB.

I've also added: Added caching of member pages. Added check in case front page doesn't load, possibly due to rate limiting.

https://github.com/hugovk/congress-legislators/commit/721605b6db3538abb746a254fb97f554ef07586a

konklone commented 10 years ago

Very cool. Want to submit this as a PR to this project, since you've got it in your fork? I can migrate it to a new repo and give you write access from there.

hugovk commented 10 years ago

I've submitted PR #167.

If there's any other useful data in the member guide, this code could be easily adapted to scrape it. Python's mechanize and BeautifulSoup are very useful!

konklone commented 10 years ago

Closed by #167.