unitedstates / congress-legislators

Members of the United States Congress, 1789-Present, in YAML/JSON/CSV, as well as committees, presidents, and vice presidents.
Creative Commons Zero v1.0 Universal
2.08k stars 507 forks source link

New House member data XML file #276

Open JoshData opened 9 years ago

JoshData commented 9 years ago

The House Clerk is publishing a new member data file in XML linked from http://clerk.house.gov/:

http://clerk.house.gov/xml/lists/MemberData.xml

We should replace any scraper with this data where possible.

H/t and big thanks to @GullicksonK!

schmod commented 9 years ago

It's a Christmas Miracle!

joelcollinsdc commented 9 years ago

I started working on this here:

https://github.com/joelcollinsdc/congress-legislators/commit/16ed6dbc3b69a78420f47424e00e3f2f0050afa2

Some questions:

JoshData commented 9 years ago

I totally forgot about this. Neat.

Should all data in the Clerk's XML feed take precedence over data in congress-legislators? Lots of middle names and such are different. Also lots of first names in the clerk XML feed appear to just be initials? I am presuming this was their official preference...

I'm not sure. It depends on how everyone uses the data to render names. G. K. Butterworth^H^H^HButterfield, for instance, is always trouble. I'll need to look more carefully.

Should I remove the , before Jr. in the offical name to match what is currently in this repo?

You mean in official_full? I added official_full only for the purposes of sanity checking that the members were aligning properly against the old unstructured data source. Since this new data has a bioguide ID in it, we should just drop the official_full field from our YAML (at least as it exists now --- if we want to put in a full name field, we should rethink what the best content should be).

joelcollinsdc commented 9 years ago

These are the name related fields that appear to be available

<lastname>Young</lastname>
<firstname>Don</firstname>
<middlename/>
<sort-name>YOUNG,DON</sort-name>
<suffix/>
<courtesy>Mr.</courtesy>
<official-name>Don Young</official-name>
<formal-name>Mr. Young of Alaska</formal-name>

There is one member (IL18) that has no official-name for some reason (workflow?). I imagine there is a lot of thought that goes into this stuff. It would make sense to me to make these fields available via this repo for those that choose to use them.

GullicksonK commented 9 years ago

Joel,

IL-18 was the most recent vacancy. I'll check on the official list on the Clerk's page tomorrow and see why it's empty. Clerk staff usually don't update the XML until all variations of the name are known.

Also, keep in mind a Member's official name may be different than their legal name or the name they use on the ballot. I can provide some samples if folks want some.

Kirsten

Sent from my iPhone

On Sep 27, 2015, at 4:49 PM, Joel Collins notifications@github.com wrote:

These are the name related fields that appear to be available

Young Don YOUNG,DON Mr. Don Young Mr. Young of Alaska

There is one member (IL18) that has no official-name for some reason (workflow?). I imagine there is a lot of thought that goes into this stuff. It would make sense to me to make these fields available via this repo for those that choose to use them.

— Reply to this email directly or view it on GitHub.

GullicksonK commented 9 years ago

The missing data for IL 18 has been resolved. An updated file is available on clerk.house.gov.

Sent from my iPhone

On Sep 27, 2015, at 4:49 PM, Joel Collins notifications@github.com wrote:

These are the name related fields that appear to be available

Young Don YOUNG,DON Mr. Don Young Mr. Young of Alaska

There is one member (IL18) that has no official-name for some reason (workflow?). I imagine there is a lot of thought that goes into this stuff. It would make sense to me to make these fields available via this repo for those that choose to use them.

— Reply to this email directly or view it on GitHub.

joelcollinsdc commented 9 years ago

Thanks @GullicksonK!

Would you be able to comment on how the first, middle, and lastname fields are decided on? For example, it appears there are many fewer middle names available in the XML feed than exist in this repo... does that reflect the Member's desire to not use a middle name, or just the data wasn't available? And some first/middle names appear different than one would expect, "G" "K" "Butterfield" for instance, instead of Goerge Kenneth Butterfield. I think this repo retains the concept of a nickname which allows them to store both the real first and last names as well as the Member's preferred name.

joelcollinsdc commented 8 years ago

I added a PR with this work, there are 2 new name fields for house members (list and formal). The Clerk XML also has a sort name field which is useful for properly sorting names with special characters, should that be added as well?

There was some discussion before if the First, Middle, Last should be replaced with what the XML has. I'm open to switching these to be left alone if thats what others think is best for now.

JoshData commented 8 years ago

@joelcollinsdc and I had a good conversation about this yesterday. I think it's worth considering, in the long run, dropping first/middle/last name fields and replacing them with e.g.:

full_name: G.K. "Jeekay" Butterfield    (made-up nickname for the purposes of example)
for_sorted_lists: Butterfield, G.K.
surname: Butterfield

... given the difficulty of dividing names into first/middle/last parts and then reconstructing good display strings from those fields (whether the middle name be included in the display string varies from person to person). Joel pointed out that Rep. Auma Amata Coleman Radewagen is anther particularly difficult one. Here's the House XML data:

<official-name>Aumua Amata Coleman Radewagen</official-name>
<namelist>Radewagen, Aumua Amata</namelist>
<firstname>Aumua Amata</firstname>
<middlename>Coleman</middlename>
<lastname>Radewagen</lastname>
<suffix/>
<courtesy>Mrs.</courtesy>
<formal-name>Mrs. Radewagen</formal-name>

We currently have:

first: Aumua
last: Amata

which is substantially different!

konklone commented 8 years ago

Separating out the middle name doesn't seem very important, but separating out the "nickname" does. This is because it's not ideal to use the nickname in full quoted form like G.K. "Jeekay" Butterfield. You generally just want to say Jeekay Butterfield.

That's because the "nickname" is usually just what they go by. But, looking at the official House data, it looks like the Clerk generally uses the nickname they actually go by as their official first name (like Dave Brat instead of David Brat). So, if we were comfortable just adopting the Clerk's first name as the official first name, I think that would get us what we want.

For the case where the House doesn't have their preferred name (e.g. the Clerk's Christopher Smith goes by Chris Smith -- our unitedstates data has his nickname correctly as Chris) we could override it, or we could ask the House if they would reconsider which name they consider official.

JoshData commented 8 years ago

Makes sense, I think. If we use the firstname for what the Clerk says the rep goes by, and drop their legal first name and any nicknames from the data, that will simplify things.

If the display string is always first + last or last + ', ' + first (no middle/nickname in either case; both would add ', ' + suffix if present), then we're fine just using first/last fields. I suppose every name can be decomposed into two parts like that and then the full and last-first forms are easily constructible and we wouldn't need extra fields.

I'd just want to actually check that that's the case (I'll want to compare to names on GovTrack before we merge). Also have to figure out if any change in that logic affects how historical reps are displayed.

GullicksonK commented 8 years ago

I would encourage use of the middle name as well. It would be safest: First + Middle + Last

There are many name types: legal name, nickname, name on ballot, name on public disclosures, and then, the official name with the U.S. House.

Using all three First + Middle + Last makes sense, particularly if you look at the following records:

· S000480 Slaughter

· R000600 Radewagen

· R000576 Ruppersberger

· S000185 Scott

JoshData commented 8 years ago

Fyi here, I left some comments on the PR in #345.

joelcollinsdc commented 7 years ago

The other data available in the clerk xml file is committee memberships. I've started working on this in https://github.com/unitedstates/congress-legislators/tree/276-clerk-xml-comm-mbrs

I noticed running the current committee_membership.py script results in a lot of changes, lots of things being output in seemingly random order. Is preserving order important?

Also, is keeping the thomas ID around important? Why have 2 ids?

dwillis commented 7 years ago

I don't think order of committee memberships is important (to me, at least). I vote to ditch Thomas ID.

JoshData commented 7 years ago

+1 to ditching thomas IDs. There aren't thomas IDs for new members at this point anyway, afaik.

Having stable order in the output from run to run makes diffs nicer, but the particular order isn't important to me either.

konklone commented 7 years ago

+1 to ditching THOMAS IDs for the reasons mentioned.

I think I remember the order of committee memberships being important, not just the chair/ranking member, but for seniority down the whole committee. It may only be practically important internally to the committee, but if there are any effects at all then it's politically relevant and a (subtle) data point we should preserve.

That said, I haven't analyzed the scraper recently and don't know why it's not outputting in a consistent order. If it's a super-hard task to do, it could be worth bending.