unitedstates / congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.
https://github.com/unitedstates/congress/wiki
Creative Commons Zero v1.0 Universal
931 stars 202 forks source link

Vote format has changed for House 2020? #258

Open demongolem opened 4 years ago

demongolem commented 4 years ago

Here is one that is not Python 3 :)

I am running over the code in vote.py and I see that the regex on vote id is failing. And that is because instead of the 4 parts that were expected I am seeing some vote string have 5th parts. The 4th part was the year, but in this string the 5th part is now the year and the 4th part is something which I have not discovered yet. Let me give you an example string. Perhaps the format has changed and newer vote ids need separate processing.

h102-116.5.2020

For regex I have something like this is split_vote_id which is actually in utils.py. Maybe I am missing the end $ in mine, but anyhow an additional number group representing the 5 above needs to be added.

    return re.match("^(h|s)(\d+)\-(\d+)\.(\d+)\.(\d\d\d\d|[0-9A-Z])", vote_id).groups()
    #return re.match("^(h|s)(\d+)-(\d+).(\d\d\d\d|[0-9A-Z])$", vote_id).groups()
JoshData commented 4 years ago

I run these scripts every few hours every day to pull in new data and haven't been having a problem.

What command line are you using? Where is this vote id coming from?

demongolem commented 4 years ago

I use ./run votes. A typical value for vote_id at the above line which I commented out is

h102-116.5.2020h102-116.5.2020h102-116.5.2020h102-116.5.2020h102-116.5.2020

except they are unique ids concatenated together (not the same vote id over and over again) which need to be split (I don't have the output in front of me right now)

At https://github.com/unitedstates/congress/wiki/votes I see the vote id looks like

"vote_id": "h202-113.2013"

JoshData commented 4 years ago

"h202-113.2013" is what the vote IDs should look like. I'm not sure where the .5 is coming from.

Can you post a stack trace when you get a chance? Hopefully that'll point us in the right direction. :)

demongolem commented 4 years ago

When I was logging these vote_ids to disk, I omitted a newline :(. So really vote_id is only a single vote_id of the form I indicated.

When I do ./run votes, here is the beginning of the output I get

Going to fetch 102 votes from congress #116.5 session 2020 h102-116.5.2020 h101-116.5.2020 h100-116.5.2020 h99-116.5.2020 h98-116.5.2020 h97-116.5.2020 h96-116.5.2020 h95-116.5.2020 h94-116.5.2020 h93-116.5.2020 h92-116.5.2020 h91-116.5.2020 h90-116.5.2020 h89-116.5.2020 h88-116.5.2020 h87-116.5.2020 h86-116.5.2020 h85-116.5.2020

And here is the stack trace which is received with the regex as it was

[h1-116.5.2020] Exception:

Traceback (most recent call last):

File "/home/gwerner/from_greg/congress/tasks/utils.py", line 182, in process_set results = fetch_func(id, options, *extra_args)

File "/home/gwerner/from_greg/congress/tasks/vote_info.py", line 15, in fetch_vote vote_chamber, vote_number, vote_congress, vote_session_year = utils.split_vote_id(vote_id)

File "/home/gwerner/from_greg/congress/tasks/utils.py", line 156, in split_vote_id return re.match("^(h|s)(\d+)-(\d+).(\d\d\d\d|[0-9A-Z])$", vote_id).groups()

AttributeError: 'NoneType' object has no attribute 'groups'

If I go to an online python regex validator, obviously there will be no matches for the vote_ids which I have supplied.

JoshData commented 4 years ago

I'm going to go out on a limb here and say that you are somehow running this with Python 3 or non-standard Python 2 command-line arguments? "116.5" looks like 116-and-a-half which suggests some Python 3 division is happening.

demongolem commented 4 years ago

Yes, I see where the division is happening in utils.py

def congress_from_legislative_year(year): return ((year + 1) / 2) - 894

Of course in python 3 that would be // instead of /