Congressional Hearing Parser

unitedstates / congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.

https://github.com/unitedstates/congress/wiki

Creative Commons Zero v1.0 Universal

929 stars 202 forks source link

Congressional Hearing Parser #290

Open connorjoleary opened 2 years ago

connorjoleary commented 2 years ago

Overview

This project gathers transcripts made available by the US Government Publishing Office and uses this information to assign who said what during federal congressional meetings. The data can then be used to gather insights on the speaking patterns of each representative.

How to use

Follow installation instructions in the main README to install the correct python libraries
Go to this website and create an api key
1. https://api.govinfo.gov/docs/
Create .env file in this folder with the key
```
GOV_INFO_API_KEY=<gov_info_key>
```


1. Run `python congress/contrib/congressional_hearing_info/grab_congressional_hearings.py --num 10`

dwillis commented 2 years ago

@connorjoleary thank you for this - it's a really useful area for us to go in. I'd like to hear from @JoshData about it, and in particular having a dependency on ProPublica's API (full disclosure, I currently run that API, but I'm not full-time at ProPublica and I can't guarantee that I'd be able to immediately address errors or downtime in every case). It appears that this PR uses the API to get current members of the House and Senate; I suspect there might be other ways to do that (using the congress-legislators repository, for example).

connorjoleary commented 2 years ago

Oh, a very good point. I'd be happy to switch out using propublicas API for that. Should be a fairly straightforward change assuming they both use the same ids.

DanielSchuman commented 2 years ago

There's also the new official congress.gov API, which also should have a list of current members of the House and Senate. https://www.congress.gov/help/using-data-offsite

Message ID: @.***

JoshData commented 2 years ago

Thanks for sharing this, @connorjoleary.

Yeah it would be nice if all of the data is fetched in a consistent way throughout this repository: legislators from congress-legislators, GPO documents from the fdsys scraper. But I won't block it based on that.

I'd like that incoming code remain maintained by its maintainer for some reasonable period of time and be documented in a similar way to other tools in this repo (in the main README and the github wiki section), And if that's the case, there's no need to put things inside a contrib directory - it can just be along side everything else here. (i.e. I would like to avoid this repo becoming a landing place for unmaintained code. That creates a burden for the rest of us.)

connorjoleary commented 2 years ago

Thank you all very much for the comments. I'm happy to maintain this code for a while after it is in place, but I do worry that the quality of this code is not up to snuff. The transcripts do not always follow consistent formatting, sometimes names are misspelled, and attributing who is speaking can be difficult (for example one hearing had two people with the same last names, but distinguished them by gender). This means that the output of this hearing parser is not always accurate.

With that being said do you all still feel like this code would fit in alongside everything else? Also, if any of you happen to have a way to contact the people who write these transcripts, it would be very helpful if you could ask them to please adapt a consistent format 😆.

connorjoleary commented 1 year ago

Hey all, update on this project. I created a website to easily search and visualize this data. I'll likely continue to make updates to my fork of this repo, as well as use it to collect more text data. Please let me know if you would like this data to be available from this project by approving this PR.

Link to website: congresstext.com

michaelblyons commented 1 year ago

Depending on how far back you want to go, you may be interested in #236.