Closed smplater closed 3 years ago
@smplater Could I get examples of Chamber, Committee, and Event Type?
Guessing event type could be -House of Representatives -Senate -OPM These are the only datasets I was able to download so far Do we have more datasets? There's a few event attributes we may be able to use but they seem to be optional and not always present. such as measure & business
Committee and Chamber I'm not sure how I can classify events by committees with current data set So this one we may not be able to support?
Chamber, is this House of Representatives vs. Senate? or wasn't sure what this was
I'm trying to normalize 3 different event datasets into our DB There's not too much info we can use besides pure event data Additional processing may need to be done or we may need more datasets
Ari,
I may be able to help.
Senate committee proceedings are all available at:
Because they come from this source, you can automatically categorize all the data as Senate committee proceedings.
House committee proceedings are all available at
<a class="downloadXML" href="Download.aspx?file=/billsthisweek/20210517/20210517.xml" xmlns:dt="http://xsltsl.org/date-time"> XML
See specifically Bills This Week
An XML file for each week is available for the “Bills to be considered on the House Floor” section of docs.house.gov. This XML is well-formed. The elements and attributes are self-describing. Committee Repository
An XML file for each meeting is available in the “committee repository” section of docs.house.gov. The XML is well-formed. The elements and attributes are self-describing.
The
Data that comes from this source can be automatically categorized as House proceedings, either committees or floor actions, depending on where you get it from
IF, HOWEVER, the data is being pulled from the Congress.gov committee schedule
The website that people can read is https://www.congress.gov/committee-schedule/weekly/2021/05/17?searchResultViewType=expanded There is no obvious way to download that contact as structured data as far as I know. But I have not played with it.
On Wed, May 19, 2021 at 1:09 PM David R. Lee @.***> wrote:
@smplater https://github.com/smplater Could I get examples of Chamber, Committee, and Event Type?
Guessing event type could be -House of Representatives -Senate -OPM These are the only datasets I was able to download so far Do we have more datasets? There's a few event attributes we may be able to use but they seem to be optional and not always present. such as measure & business
Committee and Chamber I'm not sure how I can classify events by committees with current data set So this one we may not be able to support?
Chamber, is this House of Representatives vs. Senate? or wasn't sure what this was
I'm trying to normalize 3 different event datasets into our DB There's not too much info we can use besides pure event data Additional processing may need to be done or we may need more datasets
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aih/FlatGov/issues/78#issuecomment-844300311, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWRVUBDP7DYEWYOCGIJJSTTOPWDPANCNFSM4U4TT3GQ .
Thank you Daniel. I was looking at this.
1. For the senate hearings, I think we are good because I could get the data from the mentioned URL: https://www.senate.gov/general/committee_schedules/hearings.xml
2. I think they're all hearings though, I wasn't sure where I could find senate committee markups yet
3. Regarding the house data, there's no easy resource to consume calendar data, it seems. The data is on ASP/HTML pages, but HTML is not necessarily well-defined, so we could have a lot of parsing issues. Fortunately, it looks like calendar data is not available for June, so maybe we only have to scrape a little bit of data at a time.
4. As a workaround, I thought about consuming the latest (past) committee meetings (XML), and parse the future committee meetings from the website. Today, the latest (past) committee meetings are not showing up, which is odd because it was listing previous meetings and not just today's. I'm thinking of scraping the monthly view, but this gets a bit hairy so I'll work on some of the other issues first
Thank you, David. I might be able to clear up a few more things.
There is nothing in the XML to distinguish between hearings and business meetings, so just treat them all as committee proceedings. So, looking at the Senate XML page that you linked to above...
This is a hearing
this is a markup, aka a business meeting
This is another business meeting, but instead of looking at legislation, it looks at a nomination.
It's important to keep in mind that this needs to be constantly checked, at least once a day (maybe twice). Notice of committee hearings is supposed to be given 7 days in advance and notice of committee hearings is supposed to be given 3 days in advance, but this is not always followed. For example, you probably will find very little calendar data for June.
There is XML for the pages, but the info is at the page level.
Here is a particular calendar item, a markup of H.R. 1629
This is the landing page: https://docs.house.gov/Committee/Calendar/ByEvent.aspx?EventID=112659
Notice the URL increments in the event ID, which suggests one way to get each item is to increment the ID.
Regardless, at the landing page, there is a button for download the meeting ID.
It creates a local temp file. Here's mine: file:///C:/Users/DANIEL~1/AppData/Local/Temp/HMTG-117-RU00-20210517.xml
The time file appears to be well-formed XML. It contains, among other things:
<start-time>15:00:00</start-time>
Note that meeting information can be updated and meetings can be postponed.
I will ask to see if there's XML for each calendar week that should make discovering this information easier. We know that it's parsable and usable because this is how, we think, that Congress.gov gets the House meeting information.
Also, Josh Tauberer had built a tool that pulls down all the House and Senate proceedings and put them into an agenda. We can ask him for his code, which should be on github somewhere.
3.
I don't understand this: "As a workaround, I thought about consuming the latest (past) committee meetings (XML), and parse the future committee meetings from the website. Today, the latest (past) committee meetings are not showing up, which is odd because it was listing previous meetings and not just today's. I'm thinking of scraping the monthly view, but this gets a bit hairy so I'll work on some of the other issues first"
Can you explain more about what you are having trouble finding?
On Fri, May 21, 2021 at 4:20 PM David R. Lee @.***> wrote:
Thank you Daniel. I was looking at this.
1.
For the senate hearings, I think we are good because I could get the data from the mentioned URL: https://www.senate.gov/general/committee_schedules/hearings.xml
1.
I think they're all hearings though, I wasn't sure where I could find senate committee markups yet
1.
Regarding the house data, there's no easy resource to consume calendar data, it seems. The data is on ASP/HTML pages, but HTML is not necessarily well-defined, so we could have a lot of parsing issues. Fortunately, it looks like calendar data is not available for June, so maybe we only have to scrape a little bit of data at a time.
1.
As a workaround, I thought about consuming the latest (past) committee meetings (XML), and parse the future committee meetings from the website. Today, the latest (past) committee meetings are not showing up, which is odd because it was listing previous meetings and not just today's. I'm thinking of scraping the monthly view, but this gets a bit hairy so I'll work on some of the other issues first
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/aih/FlatGov/issues/78#issuecomment-846231019, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWRVUELY6ZO5NI3KHGTAI3TO257VANCNFSM4U4TT3GQ .
I don't know how to pull in @joshdata (Josh Tauberer) to this Github thread, but if he has a parsing tool for the House calendar that is publicly-available hopefully he can point us to it, because he had gotten this working.
Hi all. Yes of course there is already a scraper.
Documentation: https://github.com/unitedstates/congress/wiki/Committee-Meetings Script: https://github.com/unitedstates/congress/blob/master/tasks/committee_meetings.py
Thank you Daniel, Josh! I was able to use the existing parser to get the committee schedules It looks like there was XML API for each committee
The existing parser gets a lot more details (documents, witnesses, etc.) so I removed some of that logic for ours to reduce some complexity There's some bad data still in the XML API, so I further reduced the scope to only look at data in the past 10 days vs 60 days
Implemented
There should be a filter for "Chamber" "Committee" and "Event Type"