unitedstates / BillMap

Utilities and applications for the FlatGov project by Demand Progress
Other
14 stars 2 forks source link

Update the definition of 'identical' bills to encompass 'companion or antecedent' bills #465

Open aih opened 3 years ago

aih commented 3 years ago

As discussed today on the call, the definition of 'identical' bills, used in the context section, should be narrower than it is now.

It should include: the bill itself, the companion bill (in the other chamber), bills identified by CRS as 'identical', and 'antecedent legislation'.

Our current categories in the BillMap project do not easily map to 'antecedent' legislation. We find bills that are related in the following categories:

  1. Identified by CRS (CRS does not identify bills from previous congresses)
  2. Similar sections (this identifies related and included legislation; it leads to false positives for antecedent bills)
  3. Nearly identical text (this may identify antecedent legislation, but small changes between Congresses may mean we do not consider the bill 'nearly identical')
  4. Same title (we do not distinguish between the short title of the bill and titles of smaller portions of the bill). This allows us to identify related legislation but would lead to false positives if we use it to identify antecedent bills.

To identify antecedent legislation, Josh considers bills that have the same title for the whole bill + same sponsor. That is a more accurate measure, but would be a significant additional effort to implement in BillMap.

aih commented 3 years ago

PR #466 takes a step toward narrowing the definition, by only including bills that we identify as 'identical' and 'nearly identical'. As discussed above, that will eliminate most bills from previous congresses from the list of 'identical' bills.

DanielSchuman commented 3 years ago

Here's my best guess for what might help to make antecedent work to include "this bill but from the prior Congress":

  1. Identified by CRS (so we get the companion bill in the current Congress) 1b. For companion bills that CRS has identified, we also may wish to run the query for its antecedents. (And only look at prior congresses so we don't create a recursive loop. :) )

  2. Nearly identical text (should help us go across multiple Congresses, but will miss some, but that's okay)

  3. Same short title -- not additional titles. This should help us go across multiple Congresses. This likely will lead to some false positives, but hopefully there will not be too many false positives. (a) Josh identified an approach of looking to see whether the first two identified sponsors are the same as the first two identified sponsors in previous legislation, but that could entail significant effort although it would be very precise. (b) Perhaps a less intensive effort would be to compare the size of the bill. You could look at the file size or the number of characters or words in a bill. If they are within, say, 10% of each other in terms of size, then it is more likely an antecedent bill. This would do a good job of keeping out when the bill we are examining is included in a much larger bill, which is fine by me for the purposes of identifying an antecedent. To narrow it further, you could also require at least one section to match.

Is this a useful approach?

On Tue, Jun 22, 2021 at 4:40 PM Ari Hershowitz @.***> wrote:

PR #466 https://github.com/aih/FlatGov/pull/466 takes a step toward narrowing the definition, by only including bills that we identify as 'identical' and 'nearly identical'. As discussed above, that will eliminate most bills from previous congresses from the list of 'identical' bills.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/aih/FlatGov/issues/465#issuecomment-866318242, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWRVUAPQDVF2L5SQLT3EUTTUDYMTANCNFSM47ENVGTQ .