Open eheitfield opened 6 years ago
Hi @eheitfield! Very nice work, and thanks for offering to integrate this data! Are you using it anywhere yourself?
Would you mind taking a look over the current bioguide parsing script we have at https://github.com/unitedstates/congress-legislators/blob/master/scripts/bioguide.py ? We use it to fill in some basic fields in the YAML files there, and it also has some special case logic in there for the times when the Bioguide has shown inconsistencies.
I don't think @JoshData or @dwillis or I have found a lot of use for the raw biographical text itself, though I don't want to speak for them, and we've definitely parsed data out of those text blocks. I'd be interested in their thoughts on the value.
Hi @konklone
Thanks for your quick response. In answer to your question, I plan to use the full-text biography information in my Our Congress iOS app to allow users to more easily learn about the backgrounds of sitting members of Congress.
I've taken a look at the script you mentioned. It's quite impressive and obviously a lot more extensive than the simple one I use for extracting the full-text information. With the caveat that I haven't worked with YAML and am fairly new to page scraping in general, I do have a few thoughts about how the script might be extended:
It seems like it would be trivial to save the full-text biography information as an additional item in the YAML database under a field like bio:summary
More ambitious, but I think also doable, would be to try to extract information about each member's profession prior to joining congress. Just from scanning a sample of bioguide full-text data for sitting members, it looks like they use pretty standard descriptions: "lawyer, private practice", "engineer", etc. I could imagine writing a script that tabulates word/phrase counts for a large number of bioguide records, looking at that list to identify words/phrases that clearly describe professions, and then doing a second pass through the bioguide to match profession descriptions with members.
Yeah I mean it's interesting stuff. I first parsed it about a decade ago (and that's how the congress-legislators database came to be originally). But I've never had any reason to display the full text blob - the biographical information (beyond what we already have in congress-legislators) isn't something GovTrack users have ever expressed an interest in, or at least they haven't found it difficult to find since we link out to bioguide.congress.gov on each legislator page anyway.
The Biographical Directory of the U.S. Congress includes text paragraphs summarizing the career of each (current and former) member or Congress. The information is quite easy to scrape because GET requests for member biography pages follow a standard URL scheme and the pages themselves are standardized and well formatted. I've created a repository of JSON formatted biographical information for sitting members here. As documented here, I've set the repository up in a manner similar to the member photos repository; users can retrieve individual records using a standard file name convention. Would people find it useful to integrate this information into the broader Congress project? If so, I'd be happy to help out.