unitedstates / congress-legislators

Members of the United States Congress, 1789-Present, in YAML/JSON/CSV, as well as committees, presidents, and vice presidents.
Creative Commons Zero v1.0 Universal
2.05k stars 506 forks source link

LIS IDs for committees #62

Closed schmod closed 11 years ago

schmod commented 11 years ago

Given that the information is now readily available, could we discuss the prospect of referring to committees via their LIS IDs instead of the current hodgepodge of Thomas IDs and partial LIS IDs?

At least on the Senate side, LIS IDs seem to be the de-facto standard for referring to committees, and this repository's way of providing committees (and especially subcommittees) with unique IDs seems unnecessarily convoluted.

Right now, standard committees look like

name: Senate Committee on Appropriations
  url: http://appropriations.senate.gov/
  thomas_id: SSAP
  senate_committee_id: SSAP
  subcommittees:
  - name: Commerce, Justice, Science, and Related Agencies
    thomas_id: '16'
  - name: Energy and Water Development
    thomas_id: '22'
-----
- type: house
  name: House Committee on Agriculture
  url: http://agriculture.house.gov/
  thomas_id: HSAG
  house_committee_id: AG
  subcommittees:
  - name: Conservation, Energy, and Forestry
    thomas_id: '15'
    address: 1301 LHOB; Washington, DC 20515
    phone: (202) 225-2171

In our list (as far as I can tell) the thomas_id always matches the senate_committee_id, which (I think) is supposed to be the LIS ID's prefix. (However, LIS always refers full committees as XXXX00). On the house side, house_committee_id always seems to match the second two letters of the thomas_id, which again, is the prefix of the LIS ID.

Similarly, for subcommittees, the thomas_id always seems to be equal to the LIS ID's numeric suffix. (In the above examples, Energy & Water's LIS ID is SSAP22, while Conservation, Energy, and Forestry is HSAG15).

I'm not sure if there are any inconsistencies in here (thus justifying the duplicative specification of IDs), but it sure would seem easiest to just specify each committee and subcommittee by its LIS ID.

Thoughts?

konklone commented 11 years ago

I have a similar dislike for how hodge podge it feels; it's confusing. I believe the reason for having multiple IDs is that different systems that mention committees (House.gov, Senate.gov, THOMAS.gov) don't all use the LIS ID. I think @tauberer knows more on this than I do.

I imagine there's a way to make this simpler, and maybe it's just by renaming thomas_id to lis_id or committee_id?

schmod commented 11 years ago

Diving into this further, it does appear as though the House often does refer to committees without the first two characters of the LIS ID (so, WM00 instead of HSWM00).

The LIS code system has been in use since the 93rd Congress (1973), and everybody does now seem to those codes in one format or the other (the Senate usually uses the full LIS ID, but occasionally omits the numeric subcommittee identifier at the end; Thomas uses the numeric identifier by itself for subcommittees; the House sometimes uses just the two character abbreviation, occasionally adding the numbers)

Given that the LIS ID seems to be the most "harmonized" of all of these (ie. you can very easily derive any of the others from it), it seems to be the most logical one for us to provide.

Unless there are inconsistencies...

JoshData commented 11 years ago

The inconsistencies question is the real issue. There used to be inconsistencies, especially for the joint committees.

So what are the codes that the House uses now?

schmod commented 11 years ago

Picking the IDs apart, the schema as far as I can deduce is:

1st Character:

2nd Character:

This schema does not appear to be terribly strict. The Joint Select Committee on Deficit Reduction uses a 'S' instead of a 'L.'

3rd-4th Characters:

2-Character alphanumeric abbreviation. I'm told that these are arbitrarily assigned. Usually alphabetic -- the Senate Year 2000 Technology Problem (sp2k00) is one exception.

5rd-6th Characters:

2-character numeric identifier. I'm also told that these are arbitrarily assigned, and to not infer anything from them.

The full committee is always referred to as 00, with one insane exception: The 'House Select Subcommittee on the United States Role in Iranian Arms Transfer to Croatia and Bosnia' was never actually associated with a full committee, and the Library treats it like a full committee.

Lately, there seems to be a trend to establish new subcommittees/assign new IDs, rather than rename existing subcommittees. (Senate Agriculture and HSGAC did this a bunch of times recently -- there may be a valid legislative rationale for this, but I couldn't figure out what it was...)

konklone commented 11 years ago

This is a great breakdown, thank you. Does this break down the LIS IDs only? How do House and Senate IDs differ?

schmod commented 11 years ago

As far as I can tell, the House, Senate, and Thomas all use the LIS ID, or some truncated version thereof.

The Senate seems to like using 'SPAG' or 'SPAG00' to refer to Agriculture, and always seems to use the full 6-character LIS ID for subcommittees (ie. 'SPAG16').

The House seems to prefer 'AG,' 'AG00,' or 'AG16' (the two-character abbreviation seems pretty rare). House sources overwhelmingly seem to prefer the 4-character 'AG00' format.

I'm still trying to figure out if there are any inconsistencies. It appears as though the scraping scripts actually use house_committee_id and thomas_id interchangeably, although the list of IDs and committees seems to originate from NYT, rather than any direct sources on House.gov.

JoshData commented 11 years ago

Which House sources?

If we can confirm that the IDs the House is currently using match up perfectly, then I'm OK with dropping house_committee_id and senate_committee_id, and renaming thomas_id to just id.

konklone commented 11 years ago

It'd be real nice to be able to do that; I admit to not knowing the House sources well enough to answer, without re-doing all the research @schmod is kindly doing.

JoshData commented 11 years ago

Closing due to inactivity. :)