Structured Chapter and Legal Citations

showerst commented 3 years ago

I think this is ready for a first review pass. @jessemortenson note that this should also address our 'what to do with chapter actions in MN' issue.

jessemortenson commented 3 years ago

Very interesting, Tim, I appreciate your work on this. At CE we've done a brief exploratory project on legal references, through the lens of parsing them from bill text, and it seems to have promise. I agree it would be a valuable addition to the OS data model. You've identified a lot of the concerns that came up when we looked at it.

In our experiment we also made a light attempt at structuring the reference text strings somewhat around a standard that was like [type_of_reference]:[reference_strings_from_general_to_specific]/[document_referenced] eg:

laws:chapter 95, article 1, section 11, subdivision 7
statutes:201.014/Minnesota Statutes 2018
statutes:201.071, subdivision 1/Minnesota Statutes 2018

That's obviously more work, and we didn't get far enough to prove that the additional work was worth it or not. Pushing to downstream consumers for that kind of parsing probably does make sense, especially if sufficient context is captured at the Open States level.

My main questions are:

Should there be any standard for the granularity of the references we pull in? I'm guessing most of the references you pull from that MN example were pulled from the actual bill text, right? But that bill text also has lots of more specific references ("Minnesota Statutes 2018, section 31A.10", "Minnesota Statutes 2019 Supplement, section 223.16, subdivision 4", "Minnesota Statutes 2018, section 41D.01" just to grab some randomly in "XXX is amended to read" lines). There are also references within the bill to other sections of law that are relevant to the bill (like, "Loans under this program will be made using money in the revolving loan account established under section 41B.06"). Is it cool to just ingest as much of this as we want, on a jurisdiction-by-jurisdiction basis? Or are there different meanings to those different levels of specificity/immediacy, so we need to establish at least a weak understanding of the legal domain in each jurisdiction to follow?
I agree that it seems like extracting references from bill text itself would yield a lot of the potential value in this feature, for the reasons you state. Yet loading additional concerns into bill scrapers carries the risk of introducing additional bugs that could cause a scraper to fail, or having to make extra requests (to fetch and parse the text - unless you're thinking this new method only gets added to the text extraction process rather than bills.py scraper?), or even just adding computational load regexing text. Feels like a data processing task that should be decoupled from the scraping task (remote fetches). To me this points towards the need for a multi-step ingestion process - does that fit in with how you're thinking of running scraper code, or are you feeling OK with continuing to add processing logic into the bill scrapers?

showerst commented 3 years ago

I'm responding in reverse order here because I think the response to the second question helps explain the response to the first first

I agree that it seems like extracting references from bill text itself would yield a lot of the potential value in this feature...

I strongly agree with everything you've said here.

My goal would be that we concentrate on the cases where the jurisdictions offer the legal data as metadata on bill pages we're already grabbing, or could get easily (eg one page load for the MN chaptered laws index, and a few lines of code to map that back to bills). I haven't done an exhaustive survey here but I think that gets us at least a 10 jurisdictions, probably more.

For all the normal reasons I'm strongly against adding a bill text downloader/parser to every jurisdiction's bill scrape. This would be more on the lines of an additional xpath expression in bills.py, or just converting existing extras code over to something structured. A number of states put all the references in the summary as well, which we're already collecting in situ.

FWIW if OS or one of the commercial entities wants to add in a post processing pipeline after we've plaintexted the bills that would be great, but speaking for GovHawk at least that's not the immediate intent. I see it as more of a handy side-effect of making this data scrapeable at a bill-metadata level.

To me this data resembles individual roll-call votes, where our policy has traditionally been "We will parse this if the state makes it available in a scrapeable format, OR someone will really commit to building and maintaining a PDF parser", rather than things like bill versions where 52 jurisdiction coverage is a project essential, at any cost in code complexity or scraper runtime.

Should there be any standard for the granularity of the references we pull in?

You're right that I pulled those MN refs out by hand.

My preference here would be to leave this to the scraper writers -- Some jurisdictions give you a nice itemized list of various depths, and for some it's just the jumble of textual references.

My thinking is that if we do a first pass that catches low hanging fruit "Minnesota Statutes 2018, section 31". That's valuable on its own, and then if someone has a use case to justify the complex work to go parse every subsection and spanning reference to turn that into a list of 37 subsection references, then we don't need to make any data model changes. As long as that code doesn't absolutely obliterate the scraper maintainability, it's still a win for us. If someone steps up to do that we can have a longer discussion with them about moving the complex parts to a package or utilities file or something.

There will be big variability in the quality of the data, but that's the case with lots of other things in the data so I don't think it's a deal-breaker here. Frankly just having the chapter laws with clickable links in the jurisdictions where that's available is a great case of 80/20, since in my experience what bills actually do once passed is a common point of confusion.

FWIW Along the same lines as you posted above, I looked into saving references as free-form objects, e.g. {'chapter': '12', 'section': '23A', 'subsection': '3', 'paragraph': '1'} but in my research different consumers seem to have differing opinions about what the most granular bits are, and even just the naming of levels is confusing. What's at depth 2 might be a 'section' in one jurisdiction and a 'title' in another. When you hit registers and register equivalents that gets goofy as well. It just didn't seem worth writing (N) parsers for this in the scrapers, as opposed to letting consumers choose their model.

RE:specificity/immediacy -- I think having a list broken down by both reference and optionally expires/effective solves that concern, if it turns out only having data for $jurisdiction at the title level instead of the subsection level introduces cases where we're providing data so inadequate it's misleading, we can improve the parser or just remove it.

This is also a bit of a punt because I don't think it's worth our effort to wade through all the complexity and try to produce consistent standards for such a small feature of the project, particularly if we're supporting DC, and potentially the feds in the future. Codes aren't terrible but chapter laws, constitutions and administrative registers are a real minefield.

In general I think that an open ended "list of citations linked to the source in a box in the right rail of the page" is probably ideal for both Openstates and many of its consumers, but the provided metadata gives other consumers the raw material to parse that into links to their own internal law/code databases, or Justia or Lexis or whatever.

jamesturk commented 3 years ago

Thanks for this proposal, I think this approach makes sense for the reasons articulated.

100% agree that trying to agree on a standard for components/etc. is a white whale, former OSer Thom Neale chased it for a while and while I never gained the understanding of it he had, my take away is that this is an essential intermediate step if we are ever going to move this to structured data, we'd at least need these raw references first. (FWIW, I also have come to believe pretty much any additional work on normalization probably belongs in a post-processing step.)

I do think we should probably add a bit of this discussion to the document itself for posterity, what do you think about adding a section on limitations/etc that at least mentions the guideline of not exploding the bill scrape with parsing of the bill text as a general rule, and as much of the above discussion as you think warranted.

jamesturk commented 3 years ago

Realized that we should accept this as a draft. I figure the criteria for accepting a draft are that the proposal is complete and sound. This way drafts will be in the main history, even if rejected later.
(The process is new, so if in the future we'd rather just have one PR to get things to final we can figure that out.)

showerst commented 3 years ago

@jamesturk cool.

I'll get the relevant conversation merged into the proposal, then start on the core PR later in the week. I may need some help on things on the openstates schema/migration side, i'll ping you when I get there.

jessemortenson commented 3 years ago

Good conversation - I feel comfortable with where this is going.

openstates / enhancement-proposals

Structured Chapter and Legal Citations #9