openstates / enhancement-proposals

Open States Enhancement Proposals
1 stars 3 forks source link

OSEP #10 #40

Closed sroomf closed 1 year ago

sroomf commented 1 year ago

Creating an Open States Enhancement Proposal for updating action classification. In summary, we would like to move all code related to action classification into its own actions.py for each jurisdiction. The EP contains details and justification for this change.

sroomf commented 1 year ago

cc @johnseekins @bfossen-ce your thoughts are appreciated!

jamesturk commented 1 year ago

:wave: @showerst has done a lot of the heavy lifting on action classifiers & will be able to weigh in on places where the action text alone isn't enough to go on. That isn't to say this proposal doesn't move things in the right direction, just that it may need to encompass some other fields/data.

As a minor, related, nit, I'd perhaps consider altering the structure of your proposed data structure from

_actions = {
    "bill action phrase": {
        "type": "string comparison or regex", 
        "mappings": ["OS classification mapping"]
        },
    "read the second time": {
        "type": "compare", 
        "mappings": ["reading-2"]
        },
    "^introduced": {
        "type": "regex", 
        "mappings": ["introduction"]
        },
    }

To

actions = [
 {"type": "compare", "value": "read the second time", "mappings": ["reading-2"]},
 {"type": "regex", "value": "^introduced", "mappings": ["introduction"]},
]

For flexibility, or take it a step further even and introduce ClassifierRule classes:

actions = [ 
   ClassifierRule(regex=r"^introduced", mappings=["introduction"]),
   ClassifierRule(string=r"read the second time", mappings=["introduction"]),
]

Which would allow for flexibility down the line.

Either way, glad to see this happening!

showerst commented 1 year ago

This looks like a great idea! It would be wonderful to have all that messy old code standardized.

Edge cases to think about:

  1. In some jurisdictions (CO, for example) we parse the actor (chamber or executive) out based on the action text. We also (rarely?) pull out bill IDs, I think for identifying substitutes and companions. You also need to preserve this ability because the executive's actions get dumped in various places by various states, and you don't want to accidentally classify "governor signed" as being performed by a chamber. I suggest allowing the regexes to return capture groups.
  2. In many jurisdictions the actor isn't part of the action string, but we know the actor from something else in the scrape (usually the page structure). Are there are any cases where we have to pass it into the classifier? Like where some string has one meaning in the House but another in the Senate? I don't think so but we should confirm.
  3. This is something big to keep in mind if you're thinking of moving classification into a totally separate job step down the road, you'll need that metadata that isn't always in the action text. I'm not sure of the best way to handle this, maybe just let the matching function take an optional actor.
  4. In some states (AZ, KS, GA) action classifications happen based on the structure of some data object they provide from an API. They don't have a list of action strings. I don't think that's a problem, I just want to flag that all 53 jurisdictions won't fit into a purely string-matching process.
  5. If you take a look at NJ we're getting a crazy dict and renaming the actions to be human readable. I think NM does this too. We might want a search/replace or rename feature.
  6. Whoever rewrites these, be mindful of the order of the actions in the existing lists. In many states the matching code will break after it finds a match, which stops it from matching on something else further down the list and adding the wrong classification. But I'm not sure every state does that :grimacing:. Some kind of test harness to make sure rewritten results match the old code is advisable. I can't find any commits now but I feel like this has bitten us in the past.
  7. In addition to ordering, I know at least NH and probably other states have cases where they purposely short-circuit the action classifier to stop it from mis-classifying. This should probably just be done by rewriting the matching regexes to not match bad cases.
  8. South Dakota is a bit of a disaster , with combined string and data structure checks. The US scraper also has a two stage data-then-text classifier.
sroomf commented 1 year ago

Thank you for your fantastic suggestions/comments/insights @showerst and @jamesturk ! Super helpful 👍

Tim, I can definitely see how detailed and varied these edge case jurisdictions are. I imagine the categorize_actions()function will probably look slightly different from jurisdiction to jurisdiction, depending on how action classification needs to be handled for each one.

sroomf commented 1 year ago

For flexibility, or take it a step further even and introduce ClassifierRule classes:

actions = [ 
   ClassifierRule(regex=r"^introduced", mappings=["introduction"]),
   ClassifierRule(string=r"read the second time", mappings=["introduction"]),
]

Which would allow for flexibility down the line.

@jamesturk, would you be able to elaborate on what the ClassifierRule class would look like? I haven't worked with classes in this way before!

jamesturk commented 1 year ago

Glad to!

What I was thinking was that there could be a very simple interface like this: (this isn't quite right, but hopefully gets the point across)

class ClassifierRule(ABC):
     @abstractmethod
     def classify_action(self, action) -> set[str]:
            pass     

And then:

class RegexRule(ClassifierRule):
    def __init__(self, regex, mappings):
          self.regex = regex
          self.mappings = set(mappings)

    def classify_action(self, action):
          if self.regex.match(action.text):
              return self.mappings
          else:
              return set()

class CustomRule(ClassifierRule):
    def classify_action(self, action):
          # let's say a state had custom numeric codes per action 
          # (I believe NM?)  this method could then just look those 
          # codes up in a dictionary and return those

Then, you'd have a list of ClassifierRule subclasses, most/all might be RegexRule (or PrefixRule or whatever), but you'd have the option to incorporate/run custom code per action.

You'd apply them something like:

for action in actions:
    classification = set()
    for rule in state.action_classifiers:
         classification |= rule.classify_action(action)
    action.classification = list(classification)

This could be extended to also have a way to set whether or not once a rule matches it should halt rule processing.

Hope this is more clear, let me know if I can clarify anything.