unitedstates / congress-legislators

Members of the United States Congress, 1789-Present, in YAML/JSON/CSV, as well as committees, presidents, and vice presidents.
Creative Commons Zero v1.0 Universal
2.08k stars 507 forks source link

Add Congress Number to legislators YAML Files "Terms" #185

Open dmil opened 10 years ago

dmil commented 10 years ago

Currently a term for a legislator in the YAML file term currently looks like this :

- type: rep
  start: '1993-01-05'
  end: '1994-12-01'
  state: OH
  district: 13
  party: Democrat

I want to add a field called "congress" (in this case the value would be 103).

My question is, how can I do this in a way that will allow me to contribute it back to the repo. I figured a yaml file to map date ranges to congress numbers and then a script to update every record in the existing yaml files is a start, but if you can think of a better way, I can do that.

dwillis commented 10 years ago

Hey Dhrumil,

That would a great start, I think, and something we've long been interested in. There are some tricky issues with end dates for some of the early Congresses, but I think this would be a good solution for most of the legislators. @JoshData?

konklone commented 10 years ago

I love the idea of creating a mapping that allows anyone to easily find out which congress a given term was part of. A YAML file, say congresses.yaml, whose sole job was to map congress numbers to start and end dates, would be hugely helpful.

I'm less excited about also updating our terms YAML historically and going forward to keep that in sync. Armed with congresses.yaml, anyone could do that lookup in the way that made sense for them, and we wouldn't have to change our terms array.

JoshData commented 10 years ago

I tried recently to do this locally, to see if it's possible. My date-to-congress function is here:

https://github.com/govtrack/govtrack.us-web/blob/master/us.py#L93 example calls: https://github.com/govtrack/govtrack.us-web/blob/master/person/models.py#L384

I ran into data errors though where old terms were crossing Congresses (my bad). Some of those errors may have since been coincidentally fixed in e0bffb35b4a523c55e21248a394543074b0a82fd.

I maintain http://govtrack.us/data/us/sessions.tsv which has the start (e.g. swearing-in) and end (adjournment) dates of each session of Congress. That would be a source for congresses.yaml (would be glad to see the data migrated here). These dates differ from the Constitutionally-defined start/end dates of Congresses (e.g. now Jan 3).

It would be very helpful for me to get this squared away. I've usually sided on keeping the YAML free of machine-inferable info. If the congress can be inferred from another mapping file, I like that, as @konklone suggested. Though I'm not opposed to also adding a field because terms could really benefit from having some sort of explicit primary key.

The state that we're in though (at least last I tried) is that it's actually not possible to assign a congress to every term. So we'd need to fix the data first.

Also see issues #7 and #157 for reasons our data may be incorrect in ways that make it difficult to resolve this. I am sure there was another issue where I mentioned I screwed up senate appointment dates but I can't find it.

dwillis commented 10 years ago

I think keeping things in sync would only really be an issue in new congresses once we were able to nail down the older stuff (which, as @JoshData points out, isn't always easy). I think the benefits of having a congress attribute outweigh the burdens of maintenance.

joec commented 10 years ago

You can get a list of the legislators for each Congress through the Bioguide form itself which accepts Congress or year.

Are you using http://www.gpo.gov/fdsys/pkg/GPO-CDOC-108hdoc222/pdf/GPO-CDOC-108hdoc222.pdf or the Bioguide data or something else? For Bioguide, the goal was to capture the Congress associated with each sworn-in legislator. We manually checked the legislators for each Congress using tables in the front of the book. I also believe the sworn-in dates from the front material is included in the individual biographies but that would require scraping the text.

I thought the Bioguide also included elected legislators as well as sworn-in legislators, but apparently not. For example, Jack Swigert isn't included as an entry, but he has a footnote in the Bioguide document. http://physics.about.com/od/classroomphysics/ig/Washington-DC-science-sites/Jack-Swigert-statue.htm

JoshData commented 10 years ago

Yes the historical data (pre-2003-ish) originated from bioguide's search results page (where the Congress numbers are listed). Sadly I turned the Congress numbers into dates via sessions.tsv and then discarded the Congress numbers. (This was around 2008.)

dwillis commented 10 years ago

Swigert used to have an entry in bioguide, actually. I found him and several other mistakes, which the Library has fixed. Similarly, you'll find George Washington and other members of the continental congress there, too.

joec commented 10 years ago

Bioguide was maintained by the Clerk of the House and the Secretary of the Senate. Is that no longer the case? http://en.wikipedia.org/wiki/Biographical_Directory_of_the_United_States_Congress
Maybe the confusion is because of the domain - congress.gov.

I might be catching up here, so I'm sorry if this has already been discussed/determined. It seems to me, given all of the data you're already storing in the YAML file and the need for more useful info, that the preference might be for a single "congresses-served" value that would contain a Congress range allowing for non-contiguous terms (e.g., 108, 110-113). The Bioguide form interacts with data in a SQL table that was created by extracting the relevant data from the Bioguide SGML files themselves and I'd say now we wanted to create a record for each Congress for each legislator, but that was 16 years ago. I definitely remember wanting to see who served with Lincoln in the House.

Unipartisan commented 9 years ago

Josh, you say, "it's actually not possible to assign a congress to every term." I assume that is because a term is longer than a congress for senators. What I would find very useful would be to group currently serving members of congress by the congress number rather than to label a historical file with a congress field.

As it currently stands, we have legislators-current.yaml and legislators-historical.yaml. If I wanted a copy of the legislators-current.yaml pertaining to 'current' members during the 113th congress, I can only get that by browsing back through the git repository to an earlier state because the current one is for the current congress, 114.

What if, at the end of a given congress, the legislators-current.yaml file were finalized and saved as legislators-113.yaml, legislators-114.yaml, etc. moving forward.

If a term were broken mid-congress my recommendation would be to simply add the successor into the list so that anyone who served any time during that congress would be listed. Doing this would allow us to classify legislators by congresses without splitting up date ranges for the terms.

A major problem I have come across in my work in regards to the legislators-historical.yaml file is a lack of a certain id field 'lis'. For example "lis: S136" which only appears for senators. This 'lis' field is in the legislators-current.yaml, but is not in the legislators-historical.yaml.

Here is why that 'lis' field matters so much for someone who may be trying to do what I am.

If you look at this .json file for sequential vote number 281 in the senate in 2013

https://www.govtrack.us/data/congress/113/votes/2013/s281/data.json

You see senators listed under Nay and Yea in the following form:

    "display_name": "Alexander (R-TN)", 
    "first_name": "Lamar", 
    "id": "S289",   (<---this is the equivalent of 'lis' in legislators-current.yaml)
    "last_name": "Alexander", 
    "party": "R", 
    "state": "TN"

The 'lis' field from legislators-current.yaml, which does not appear in legislators-historical.yaml is the only id type that links the voter to their respective listing in the legislator data. ('lis' is listed as "id" above)

First, I build a database table listing legislators and their data using legislators-current.yaml(or this can be viewed as being specific to a certain congress). Let's call that table 'legislators113', I use the table of that congress's legislators to make an empty table for votes to be added. Column 1 is the bill id, and the rest of the following columns are the legislator ids that can be paired in the vote files. For senators, that is the 'lis' id.

Then I machine-infer that 2013 or 2014 are part of congress 113 depending which year I am building vote tables for. A script checks each vote starting with h1 (/113/votes/2013/h1). If the category is not 'passage', or 'passage-suspension', that file is skipped. It runs through and saves how each legislator voted on each bill in those categories in one big table with about 33k cells for that year. This can then be repeated for 2014, 2015 etc.

If the 'lis' field goes missing from the legislators.yaml then there is nothing simple to link the data in the 2 files together. I am afraid to use names because they can be duplicated and possibly cause errors.

I am not sure what the 'lis' field actually is. Maybe it is some sort of recyclable id type and that is why it is not able to be added to the historical file to prevent duplication. However, even if that is the case, splitting them by congresses should not cause an issue in regards to that and for pther purposes allow us to easily view members by congress blocks.

I will add my scripts once I get them a little more portable. Also they are written in PHP, not python. I hope that is okay.

JoshData commented 9 years ago

Josh, you say, "it's actually not possible to assign a congress to every term." I assume that is because a term is longer than a congress for senators.

That's one reason. But I think I was referring to the fact that the historical data is not very clean and that there are some terms with incoherent dates that will need to be manually fixed. i.e. Automated assignment of a Congress number will fail. I think we're all in agreement about adding a Congress number field to terms (or at least we were a year ago). Someone just has to do the work now to make it happen.

This 'lis' field is in the legislators-current.yaml, but is not in the legislators-historical.yaml.

No, the lis field does occur in both files. We have the complete linkage of lis IDs, I believe.

(If lis doesn't appear for some person in this repo, it's because the Senate has not assigned that person an lis ID. If the lis ID occurs in a Senate vote, it is listed as someone's lis ID here. This is how I load the vote data into GovTrack.)

I can only get that by browsing back through the git repository to an earlier state because the current one is for the current congress, 114.

You should never roll back this repository. No information is ever deleted from the legislators files, only moved. You should always extract what you need from the combination of the two legislators files, filtering by date. (One day we'll be able to filter by Congress if anyone does the work to add the field.)

Unipartisan commented 9 years ago

My mistake. I just didn't see them the way I had searched the file. I assumed that all senators had one. Searched for 'sen' and the first few I saw did not have lis ids. Adding congresses would nevertheless solve the problem I am having too.