unitedstates / images

Public domain photos of Members of the United States Congress
https://theunitedstates.io/images/
Creative Commons Zero v1.0 Universal
176 stars 51 forks source link

Empty list when running `gpo_member_photos.py` script #221

Open mdrayer opened 1 month ago

mdrayer commented 1 month ago

It seems that the scraper is no longer working. I get a "Just a moment..." HTML page for the response (https://github.com/unitedstates/images/blob/gh-pages/scripts/gpo_member_photos.py#L54). This can be replicated by doing a curl on the URL the scraper is trying to hit: https://www.congress.gov/search?q=%7B%22source%22%3A+%22members%22%2C+%22congress%22%3A+%22118%22%7D&pageSize=250&page=1.

Are there any troubleshooting techniques here to make the scraper wait a bit for the HTML to load + form?

hugovk commented 1 month ago

Yeah, I think it's been a few years since the scraper script has been run to fetch images, so it's unsurprising it's not working properly. Scrapers needs quite a bit of upkeep!

Unfortunately I don't really have time to maintain this repo any more, so we should find some new maintainers willing to volunteer.

mdrayer commented 1 month ago

GPO.gov does have an API that includes these images we are attempting to scrape: https://pictorialapi.gpo.gov/. There's an endpoint to get all members of a particular session, e.g. 118th Congress: https://pictorialapi.gpo.gov/api/GuideMember/GetMembers/118 which yields images like https://memberguide.gpo.gov/pictorialimages/118_SR_NJ_Booker_Cory.jpg. However, it does not include the member's bioguide ID in the data, so we'd need to do some mapping with another dataset to get the bioguide ID.

kevinschaul commented 1 week ago

FYI I'm working to get pictorial ids into the legislators project: https://github.com/unitedstates/congress-legislators/pull/943