openstates / enhancement-proposals

Open States Enhancement Proposals
1 stars 3 forks source link

Amend OSEP #4 with a plan for how we want to handle historical data. #24

Closed jamesturk closed 3 years ago

jamesturk commented 3 years ago

Problem: We want links to committee data to be somewhat consistent, so that a link to a committee on a 2021 bill will be useful when clicked in 2022 despite the committee potentially having had dissolved/been completely overhauled.

It might be important to keep in mind 4 potential places committee data would appear: Case 1. On a bill page, linked from a bill action. In this case we'd really want the committee link to bring the user to the version of the committee closest to what existed at the time of the action. Case 2. On a legislator's page. In this case we'd want to link to the current version of the committee or last version of the committee that the legislator served on. (This distinction is actually pretty hard.) In the multi-committee approaches below this would probably be a lot messier as they'd have multiple (maybe dozens) iterations of the committee tied to them in the database. Case 3. On a standalone committee page. In this case we'd want to be clear about which version of the committee we're showing, so that if a person were looking at an old version that was at least clear. Multiple committee entries here will get complicated as people may be looking at the "wrong" version and not see the data they expect. Case 4. On a page with hearings/etc. This is similar to 3, if a person is looking for hearings but looking at the 2020 version of a committee, that'll be a problem for them.

Considered solutions:

A: Focus on the Present

The current approach.

Committees will not be tied to any time, at any given time looking up a committee would provide the current members. If a committee no longer exists, it will no longer come back in API results/etc. We'd focus on the present, and be limited in what we can show for the past.

Pros:

Cons:

Use Cases: Pretty good for cases 2-4, bad for 1.

B: Tie Committees to Sessions

Committees would be tied to individual sessions. This means we'd need to re-scrape committees every session turnover (which we should already be doing) and have code to automatically expire the old ones. Old committees would be left in place but marked as expired.

Committee membership of expired committees would be frozen in time with whatever the membership was at the final date.

Pros:

Cons:

Use Cases: Pretty good for case 1, quite bad for 2-4.

C: Tie Committees to Regular Sessions via dates

A minor iteration on B, we would tie committees to regular sessions using begin & end dates on the committee.

Pros:

Cons:

Use Cases: Pretty good for case 1, good for 2 in regular sessions, bad for 2 in special sessions, still bad for 3-4.

D: Multiple Sessions Per Committee

An alternative to B & C in the same vein, but a committee would now have a list of sessions it was valid for.

Pros

Cons

Use Cases: Pretty good for case 1, OK for 2, still fairly bad for 3-4.

E: Year Based Committees

An alternative to the above that would use years as the boundary for committees for clarity.

Pros:

Cons:

Use Cases: Pretty good for case 1, OK but not great for 2-4?

F: Keep history on committees by member, not by having multiple entries for the same committee

A completely different approach from B-E. We could instead add a begin/end date to each membership, as well as created/dissolved dates on committees. This would allow for a few things:

We can get a point in time view of any committee. So instead of having 2020 and 2022 versions of the Education committee, we'd have one committee, and display of members would vary based upon

This was another approach Open States took (2014-2017ish). It was a ton of work, since the begin/end dates are not published, but instead we rely upon detecting when members change/etc.

There's also the problem of a committee changing its name and whether or not we consider that the same committee or not. That leads to unexpected outcomes if a committee slowly changes name/purpose over time. (Committee on Education becomes committee on Education & Health, which is then renamed to Committee on Health as a new committee on education is formed, etc.)

Pros are that we can represent the true history of a committee, while cons are that working with the data becomes so cumbersome that it is nearly impossible to figure out if anything is correct & code complexity is significant.

Including this here for further discussion, but IMO the least viable option with our current plans/team.

Use Cases: If it worked, satisfies all 4 the best since complete data would be stored and we could figure out different ways to display the data for all 4. Complexity is through the roof though.

jamesturk commented 3 years ago

If people have thoughts/comments any thinking on this would be helpful, there isn't a perfect solution IMO so the trade-offs we'll pick here will have lasting repercussions depending on people's desired usage.

jessemortenson commented 3 years ago

Thanks for a thorough investigation of these issues, James. I certainly see the appeal of approach A from the implementation standpoint. Having an accurate record of committee history is a value that accumulates over time (as well as accumulating costs over time), so it's tough to swallow handling all of the complexity up front when at best that will buy an incremental value a year from now.

D, F, B are my favorite alternatives of the above.

I'm trying to think: what are the history-based insights that we might lose if we go with approach A? and can we think of ways to mitigate that loss or is it just a hard trade-off? Two examples that I've thought of so far:

Insights about people

Current committee assignments is an important source of insight about what policy areas/bills a particular legislator has influence over, and interest in. Historical assignments probably adds value there as well. This overlaps with seniority on a committee (is cmte membership history the only source for seniority knowledge in some jurisdictions?). But I also wonder about times when partisan control of a chamber might cause legislators to get booted from committees, even though those legislators are still interested in (and maybe somewhat influential in) the policy area(s) of the committee.

Mitigation? If a downstream consumer of Open States has a taxonomy of policy areas and maps those to committees, then they could record the relationship between the legislator and policy areas (based on current committee data). Something like { policyArea: 'environment', source: 'committeeMembership', sourceDescription: 'Environment Committee 2020'}. That relationship could stay around even as committees are destroyed/re-ingested. So basically the tactic is persisting the inference/insight that is based on the committee data, and keeping that.

Insights about legislation

Less sure about this one. If a bill in a past session was referred to committee A, and then the text of the bill is re-introduced in a current session, it could be useful to recognize that relationship to be able to predict what committee it will be referred to (or just to pre-classify the policy area of the bill).

Mitigation? Similar to above, perhaps downstream consumers could persist the relationship between the past bill and policy areas, rather than relying on a relationship between a past bill and a past committee to still exist in Open States data.

jamesturk commented 3 years ago

My struggle with D (which I otherwise think is the best option) is how we maintain the list of sessions that a committee is valid for. I'm wondering if you have any thoughts on how you'd want to see us handle that.

The two main challenges I see for D are:

1) how/where/when do we update the mapping? It ideally doesn't belong in a scraper, and there a couple thousand committees across jurisdictions that'll need to now be manually kept up to date. 2) what kinds of guidelines do we need to decide when we use a new committee instead of adding a new session to an old committee? simplest answer would be we typically retire all committees every regular session & create new ones. and then only special sessions need to be updated, but I wasn't sure if that was in line with what you were aiming at

showerst commented 3 years ago

Chiming in from the peanut gallery. Full disclosure I'm not using the OS data for this, so take it as just opinions.

We run committees as one entity across sessions, but with an optional end date. We assume all committees are permanent. If a committee data-load from a particular time comes back without data for that committee, it's flagged for manual review for potential retirement. We spit out a web page with the unmatched committees and their links, and have someone go look up if it's still a live committee. If it's really a dead committee and not a data error, we just end it 12/31/previous_session. This is not perfect data but it's effective in practice, though I suppose you could also use the last session's end_date.

This is generally not hugely cumbersome, it just means a bit of work for each state when the new sessions start up. If a committee has just changed names, we alias it. Whether a "Health" -> "Health & Housing" is a continuation or a new committee is a judgement call.

The biggest downside here for OS is that if a state renames a bunch of committees, there's manual work aliasing them, and trying to keep consistent about continuations.

If I had it to do over again, I might make a committee_session one to many table, with an easy to tool to check/uncheck sessions for a given committee, to avoid all the headaches around date maintenance.

As far as membership, I'd have a committee_person_session table for the same reason, but with an extra end_date date field for deaths/removals/etc.

There's a big 80/20 here around the accuracy -- Is they key fact that rep smith was on budget for the 2020 session, or that they started on jan 11 and ended on may 5?

If they stopped halfway though I think the ideal is to have the committee roster for the 2020 session show all members, but with a special section for anyone who retired/died/etc mid-session, which is equates to "anybody we didn't find in this months data that was there last month, then googled and found out they died."

jessemortenson commented 3 years ago

what kinds of guidelines do we need to decide when we use a new committee instead of adding a new session to an old committee?

I agree that's the key question. Depending on the guidelines, some automation may be possible. Brainstorming some scenarios (and what my first instinct is on how to handle it), assume same legislative chamber in all these cases:

  1. Committee name and members are exactly the same: same committee, add a new session to the old committee
  2. Committee name is exactly the same, < 50% of members have changed: same committee...
  3. Committee name is exactly the same, > 50% but less than 100% of members have changed: same committee...
  4. Committee name is exactly the same, 100% of members have changed: new committee (manual review?)
  5. Committee name shares some meaningful words, < 50% of membership has changed: same committee (manual review?)
  6. Committee name shares some meaningful words, between 50% and 100% of membership has changed: new committee (manual review?)
  7. Committee name shares some meaningful words, 100% membership has changed: new committee
  8. Committee name is different, no membership has changed: same committee (manual review?)
  9. Committee name is different, < 50% membership change: new committee
  10. Committee name is different, > 50% membership change: new committee

how/where/when do we update the mapping?

Obviously this would carry the cost of writing more tooling, but I wonder if there's a way to do a staging process, something like:

  1. Run scrapers to ingest 2022 committees (realistically this probably happens at different points in time, since states would post new committees at different points in time), but don't yet persist them to mainline data (eg main/OS DB)
  2. Compare 2022 committees to 2021 committees (prior commit?), running a set of rules - like those above - to match a bunch of the 2022 committees to the 2021 committees.
  3. Based on that comparison, modify the 2022 scraped data such that continuing committees are merged into existing committees, no-longer-matched 2021 committees are removed, and new 2022 committees are maintained.
  4. Perform manual review on whichever are flagged as needing manual review
  5. Finally persist the merged data into the mainline (main/OS DB)

Doing thousands of committees manually would be pretty painful. Even with the above there will be some manual review and manipulation. What percentage of committees do you wager carry on from session to session with the exact same name? 60%? 80%? If the number is high enough maybe we're only talking about manual review/modify of 10 committees per session per jurisdiction.

jamesturk commented 3 years ago

@jessemortenson re: automation I think that this misses the desired outcome regarding historical data that Ruby is after, but I'll let her chime in to be sure. In practice, most of the time committee membership between sessions would mostly maintain continuity (like scenarios 1-3 in your examples) but there's a desire/need to have a frozen snapshot of what the committee looked like at the end of a session, so we wouldn't want to mark them as the same committee in those cases.

And I believe it was already decided that If the name changes, it is a new committee.

jamesturk commented 3 years ago

@showerst thanks for this! In particular

"If I had it to do over again, I might make a committee_session one to many table, with an easy to tool to check/uncheck sessions for a given committee, to avoid all the headaches around date maintenance."

is good to hear, as I think that's where we're likely headed

jessemortenson commented 3 years ago

but there's a desire/need to have a frozen snapshot of what the committee looked like at the end of a session

Gotcha, I think I was assuming under scenario D we weren't accounting for that need. Is the goal to have those committee-per-session snapshots maintained in Open States data (I mean in the head of main, not just in commit history)? or to provide the ability for downstream consumers to maintain them? It seems like downstream consumers could use something like a committee_person_session table as Tim mentioned to keep track of the historical snapshots of when people were on a given committee, even as that committee continues from session to session

jamesturk commented 3 years ago

After some deliberations it has been decided that for now we'll focus on the present, and see what use cases arise that necessitate other options. So essentially option A. We'll have git history to rebuild the past if we want historical data, and Open States will hide legacy committees (in API/os.org) by default, but they may resurface later.