unitedstates / congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.
https://github.com/unitedstates/congress/wiki
Creative Commons Zero v1.0 Universal
929 stars 202 forks source link

Mapping a vote to a specific text version of a bill #313

Closed TTrapper closed 1 month ago

TTrapper commented 2 months ago

Once I've downloaded votes and bill texts, is there a way to map them? I'm currently doing this manually by extracting the billI_id from the vote data, then searching through the downloaded bills. It works but I'm still not sure if/how I can map a vote to a particular text_version (eg ih, pcs, eh, etc). Thanks!

JoshData commented 2 months ago

I don't think there is a direct way to map a vote to a text version. There's no text version in the XML vote data from the House or Senate. Often the text of what is being voted on hasn't been published yet, in some cases that's because the vote outcome is what causes the new text to be published. And there's no vote ID in the govinfo text MODS metadata. While it would be helpful, there are a lot of votes that don't have IDs (anything that isn't a roll call vote).

The govinfo bill text MODS metadata has a "Last Action Date Listed" field (which I guess is the originInfo/dateIssued element) which might correspond to the date of a congressional action, but I recall that I haven't found it reliable. I think one reason is that fast-moving bills can have multiple significant actions on the same date.

For GovTrack, I have a map from bill text versions to bill status codes that are emitted by this project. The bill status codes are determined by parsing the govinfo BILLSTATUS XML data's action list. That usually works well. https://github.com/govtrack/govtrack.us-web/blob/main/bill/billtext.py

TTrapper commented 2 months ago

Ok this is very interesting, and a bit surprising since the vote data is difficult to interpret if we can't say what exactly they were voting on. I understand this an upstream issue and not a failing of this code-base.

What do you think of an approach like the following, where I parse the BILLSTATUS XML and try to find the text version whose date is on or before the date of the vote?

edit: an import caveat is that I am only intereset in votes that have to do with the passage of a bill, so if this method has flaws in other scenarios that's fine.

def get_voted_text_version(vote_datetime, bill_status_xml):
    latest_textversion_type = None
    latest_textversion_date = None

    # Find the closest text version before the given vote date
    text_versions = bill_status_xml.find('.//textVersions')
    if text_versions is not None:
        for version_item in text_versions.findall('./item'):
            version_date = parse_date(version_item.find('date').text).astimezone(timezone.utc)

            # We're interested in the latest text version that's still before the vote
            if version_date <= vote_datetime:
                if latest_textversion_date is None or version_date > latest_textversion_date:
                    latest_textversion_date = version_date
                    latest_textversion_type = version_item.find('type').text

    # Map the found text version type to the corresponding code
    type_code_map = {
        'engrossed in house': 'eh',
        'introduced in house': 'ih',
        'received in senate': 'rds',
        'referred in senate': 'rfs',
        'placed on calendar senate': 'pcs',
        'engrossed in senate': 'es',
        'engrossed amendment senate': 'eas',
        'reported in house': 'rh',
    }

    return type_code_map.get(latest_textversion_type.lower()) if latest_textversion_type else None
JoshData commented 1 month ago

I think you'd have to try it on a number of bills to see if it works well enough for you. There will definitely be edge cases where it fails. There always are!