Amend OSEP #4 with a plan for how we want to handle historical data.

Problem: We want links to committee data to be somewhat consistent, so that a link to a committee on a 2021 bill will be useful when clicked in 2022 despite the committee potentially having had dissolved/been completely overhauled.

It might be important to keep in mind 4 potential places committee data would appear: Case 1. On a bill page, linked from a bill action. In this case we'd really want the committee link to bring the user to the version of the committee closest to what existed at the time of the action. Case 2. On a legislator's page. In this case we'd want to link to the current version of the committee or last version of the committee that the legislator served on. (This distinction is actually pretty hard.) In the multi-committee approaches below this would probably be a lot messier as they'd have multiple (maybe dozens) iterations of the committee tied to them in the database. Case 3. On a standalone committee page. In this case we'd want to be clear about which version of the committee we're showing, so that if a person were looking at an old version that was at least clear. Multiple committee entries here will get complicated as people may be looking at the "wrong" version and not see the data they expect. Case 4. On a page with hearings/etc. This is similar to 3, if a person is looking for hearings but looking at the 2020 version of a committee, that'll be a problem for them.

Considered solutions:

A: Focus on the Present

The current approach.

Committees will not be tied to any time, at any given time looking up a committee would provide the current members. If a committee no longer exists, it will no longer come back in API results/etc. We'd focus on the present, and be limited in what we can show for the past.

Pros:

Easiest to implement: essentially scrape committees, throw away old data, update with new. (history will live in git though, so we could eventually mine that for more info)
Less confusing, no duplicate copies of same committee with minor changes.
Things like bookmarking a committee page will always link to the correct data.

Cons:

No sense of historical committees & memberships.
Old links would possibly break or go missing at least.
Bill that links to Committee A in 2021 will not have good data when viewed in 2022.

Use Cases: Pretty good for cases 2-4, bad for 1.

B: Tie Committees to Sessions

Committees would be tied to individual sessions. This means we'd need to re-scrape committees every session turnover (which we should already be doing) and have code to automatically expire the old ones. Old committees would be left in place but marked as expired.

Committee membership of expired committees would be frozen in time with whatever the membership was at the final date.

Pros:

committee membership will be mostly accurate, major changes typically don't happen during a session, so only minimal historical data will be lost
1:1 ratio of session to committee set is clear & easy to explain compared to several of the options below

Cons:

states do not tie committee information to session, so these connections would be made in a semi-automated way based on when we scraped the data, this can be error prone if they aren't updating their committees online in a timely manner
instead of N committees per jursdiction, we'd need N*S where S is the number of sessions. That is more entities to track/more data to worry about.
special sessions complicate this a great deal, since some overlap regular sessions and some don't, and they can be very short and very numerous in some states
we tried this before with Open States (2011-2014 or so IIRC) & because it is too much data for manual review, it left a lot of kind of incorrect data in older sessions & there was no way we could go back & correct them
a little confusing when displayed on website ("Senate Education Committee (2021 Session)")

Use Cases: Pretty good for case 1, quite bad for 2-4.

C: Tie Committees to Regular Sessions via dates

A minor iteration on B, we would tie committees to regular sessions using begin & end dates on the committee.

Pros:

accuracy is still pretty good like in B

Cons:

uncertain how we'd handle committees outside of regular session dates, would special sessions just not get committee data? (perhaps a problem in states with important/long-running special sessions)
still a lot of data to keep track of, and the linkage between committees & sessions is now more opaque since it uses dates instead of a direct relationship
how we'd access this via the API is unclear, we typically don't use dates like this, instead preferring session identifiers.
hard to display on website "Senate Education Committee (January 3rd 2021-December 6th 2021)"?

Use Cases: Pretty good for case 1, good for 2 in regular sessions, bad for 2 in special sessions, still bad for 3-4.

D: Multiple Sessions Per Committee

An alternative to B & C in the same vein, but a committee would now have a list of sessions it was valid for.

Pros

less duplicate committees since a committee could be tied to a session & special sessions between the same election
clear linkage between committee & what sessions it is valid for

Cons

this data can't be scraped, so it would be a judgment call on which were valid
setting this data manually could become quite confusing, if a committee is made up of the same members someone would be tempted to just add '2022' to valid sessions list, but the entire purpose of separating the committees is to keep 2021 committees from changing as years go on
hard to display on website "Senate Education Committee (2021 Regular Session, 2021B Special Session, 2021C Budget Session)"?

Use Cases: Pretty good for case 1, OK for 2, still fairly bad for 3-4.

E: Year Based Committees

An alternative to the above that would use years as the boundary for committees for clarity.

Pros:

easy to explain & display "Senate Education Committee (2021)"
automation would be easy since a year is a year, no manual curation of list of sessions/etc.

Cons:

would lead to occasional weirdness when session dates are outside normal year boundaries
only data we'd track that would be year based, could be confusing if people didn't understand nuanced difference between session & year
still a lot of data to track

Use Cases: Pretty good for case 1, OK but not great for 2-4?

F: Keep history on committees by member, not by having multiple entries for the same committee

A completely different approach from B-E. We could instead add a begin/end date to each membership, as well as created/dissolved dates on committees. This would allow for a few things:

We can get a point in time view of any committee. So instead of having 2020 and 2022 versions of the Education committee, we'd have one committee, and display of members would vary based upon

This was another approach Open States took (2014-2017ish). It was a ton of work, since the begin/end dates are not published, but instead we rely upon detecting when members change/etc.

There's also the problem of a committee changing its name and whether or not we consider that the same committee or not. That leads to unexpected outcomes if a committee slowly changes name/purpose over time. (Committee on Education becomes committee on Education & Health, which is then renamed to Committee on Health as a new committee on education is formed, etc.)

Pros are that we can represent the true history of a committee, while cons are that working with the data becomes so cumbersome that it is nearly impossible to figure out if anything is correct & code complexity is significant.

Including this here for further discussion, but IMO the least viable option with our current plans/team.

Use Cases: If it worked, satisfies all 4 the best since complete data would be stored and we could figure out different ways to display the data for all 4. Complexity is through the roof though.

If people have thoughts/comments any thinking on this would be helpful, there isn't a perfect solution IMO so the trade-offs we'll pick here will have lasting repercussions depending on people's desired usage.

Thanks for a thorough investigation of these issues, James. I certainly see the appeal of approach A from the implementation standpoint. Having an accurate record of committee history is a value that accumulates over time (as well as accumulating costs over time), so it's tough to swallow handling all of the complexity up front when at best that will buy an incremental value a year from now.

D, F, B are my favorite alternatives of the above.

I'm trying to think: what are the history-based insights that we might lose if we go with approach A? and can we think of ways to mitigate that loss or is it just a hard trade-off? Two examples that I've thought of so far:

Insights about people

Current committee assignments is an important source of insight about what policy areas/bills a particular legislator has influence over, and interest in. Historical assignments probably adds value there as well. This overlaps with seniority on a committee (is cmte membership history the only source for seniority knowledge in some jurisdictions?). But I also wonder about times when partisan control of a chamber might cause legislators to get booted from committees, even though those legislators are still interested in (and maybe somewhat influential in) the policy area(s) of the committee.

Mitigation? If a downstream consumer of Open States has a taxonomy of policy areas and maps those to committees, then they could record the relationship between the legislator and policy areas (based on current committee data). Something like { policyArea: 'environment', source: 'committeeMembership', sourceDescription: 'Environment Committee 2020'}. That relationship could stay around even as committees are destroyed/re-ingested. So basically the tactic is persisting the inference/insight that is based on the committee data, and keeping that.

Insights about legislation

Less sure about this one. If a bill in a past session was referred to committee A, and then the text of the bill is re-introduced in a current session, it could be useful to recognize that relationship to be able to predict what committee it will be referred to (or just to pre-classify the policy area of the bill).

Mitigation? Similar to above, perhaps downstream consumers could persist the relationship between the past bill and policy areas, rather than relying on a relationship between a past bill and a past committee to still exist in Open States data.

My struggle with D (which I otherwise think is the best option) is how we maintain the list of sessions that a committee is valid for. I'm wondering if you have any thoughts on how you'd want to see us handle that.

The two main challenges I see for D are:

1) how/where/when do we update the mapping? It ideally doesn't belong in a scraper, and there a couple thousand committees across jurisdictions that'll need to now be manually kept up to date. 2) what kinds of guidelines do we need to decide when we use a new committee instead of adding a new session to an old committee? simplest answer would be we typically retire all committees every regular session & create new ones. and then only special sessions need to be updated, but I wasn't sure if that was in line with what you were aiming at

Chiming in from the peanut gallery. Full disclosure I'm not using the OS data for this, so take it as just opinions.

We run committees as one entity across sessions, but with an optional end date. We assume all committees are permanent. If a committee data-load from a particular time comes back without data for that committee, it's flagged for manual review for potential retirement. We spit out a web page with the unmatched committees and their links, and have someone go look up if it's still a live committee. If it's really a dead committee and not a data error, we just end it 12/31/previous_session. This is not perfect data but it's effective in practice, though I suppose you could also use the last session's end_date.

This is generally not hugely cumbersome, it just means a bit of work for each state when the new sessions start up. If a committee has just changed names, we alias it. Whether a "Health" -> "Health & Housing" is a continuation or a new committee is a judgement call.

The biggest downside here for OS is that if a state renames a bunch of committees, there's manual work aliasing them, and trying to keep consistent about continuations.

If I had it to do over again, I might make a committee_session one to many table, with an easy to tool to check/uncheck sessions for a given committee, to avoid all the headaches around date maintenance.

As far as membership, I'd have a committee_person_session table for the same reason, but with an extra end_date date field for deaths/removals/etc.

There's a big 80/20 here around the accuracy -- Is they key fact that rep smith was on budget for the 2020 session, or that they started on jan 11 and ended on may 5?

If they stopped halfway though I think the ideal is to have the committee roster for the 2020 session show all members, but with a special section for anyone who retired/died/etc mid-session, which is equates to "anybody we didn't find in this months data that was there last month, then googled and found out they died."

what kinds of guidelines do we need to decide when we use a new committee instead of adding a new session to an old committee?

I agree that's the key question. Depending on the guidelines, some automation may be possible. Brainstorming some scenarios (and what my first instinct is on how to handle it), assume same legislative chamber in all these cases:

Committee name and members are exactly the same: same committee, add a new session to the old committee
Committee name is exactly the same, < 50% of members have changed: same committee...
Committee name is exactly the same, > 50% but less than 100% of members have changed: same committee...
Committee name is exactly the same, 100% of members have changed: new committee (manual review?)
Committee name shares some meaningful words, < 50% of membership has changed: same committee (manual review?)
Committee name shares some meaningful words, between 50% and 100% of membership has changed: new committee (manual review?)
Committee name shares some meaningful words, 100% membership has changed: new committee
Committee name is different, no membership has changed: same committee (manual review?)
Committee name is different, < 50% membership change: new committee
Committee name is different, > 50% membership change: new committee

how/where/when do we update the mapping?

Obviously this would carry the cost of writing more tooling, but I wonder if there's a way to do a staging process, something like:

Run scrapers to ingest 2022 committees (realistically this probably happens at different points in time, since states would post new committees at different points in time), but don't yet persist them to mainline data (eg main/OS DB)
Compare 2022 committees to 2021 committees (prior commit?), running a set of rules - like those above - to match a bunch of the 2022 committees to the 2021 committees.
Based on that comparison, modify the 2022 scraped data such that continuing committees are merged into existing committees, no-longer-matched 2021 committees are removed, and new 2022 committees are maintained.
Perform manual review on whichever are flagged as needing manual review
Finally persist the merged data into the mainline (main/OS DB)

Doing thousands of committees manually would be pretty painful. Even with the above there will be some manual review and manipulation. What percentage of committees do you wager carry on from session to session with the exact same name? 60%? 80%? If the number is high enough maybe we're only talking about manual review/modify of 10 committees per session per jurisdiction.

@jessemortenson re: automation I think that this misses the desired outcome regarding historical data that Ruby is after, but I'll let her chime in to be sure. In practice, most of the time committee membership between sessions would mostly maintain continuity (like scenarios 1-3 in your examples) but there's a desire/need to have a frozen snapshot of what the committee looked like at the end of a session, so we wouldn't want to mark them as the same committee in those cases.

And I believe it was already decided that If the name changes, it is a new committee.

@showerst thanks for this! In particular

"If I had it to do over again, I might make a committee_session one to many table, with an easy to tool to check/uncheck sessions for a given committee, to avoid all the headaches around date maintenance."

is good to hear, as I think that's where we're likely headed

but there's a desire/need to have a frozen snapshot of what the committee looked like at the end of a session

Gotcha, I think I was assuming under scenario D we weren't accounting for that need. Is the goal to have those committee-per-session snapshots maintained in Open States data (I mean in the head of main, not just in commit history)? or to provide the ability for downstream consumers to maintain them? It seems like downstream consumers could use something like a committee_person_session table as Tim mentioned to keep track of the historical snapshots of when people were on a given committee, even as that committee continues from session to session

After some deliberations it has been decided that for now we'll focus on the present, and see what use cases arise that necessitate other options. So essentially option A. We'll have git history to rebuild the past if we want historical data, and Open States will hide legacy committees (in API/os.org) by default, but they may resurface later.

openstates / enhancement-proposals