Closed lmullen closed 1 year ago
CAP knows a lot of parallel cites. It does not know any parallels for N.J. Eq.
White list should have standardized entries for Stewart's Equity/Chancery: Stew. reporter_in_cap should be NULL.
Solving NJ Eq. cites should be paradigm case for solving all nominate reporter problems.
This is a complicated process. We want to think through the steps of the process and decide when we are done with them. Note that we are explicitly leaving out statutes and British citations, which can be dealt with at a subsequent stage.
[x] Corrections to OCR. This has essentially been superseded by subsequent steps. But since it works and could be useful in the future, we are not going to pull it out. New OCR corrections should be added to
legalist.ocr_corrections
.[x] Run the detection over MOML. Working, but some minor tweaks on the way recorded in other issues.
[x] White list and correct the reporters. This process is well understood, but there are some additional refinements.
[x] Add a
junk
field so we can more readily keep track of which junk reporters we are ignoring.[x] Add additional reporters. A prioritized list of reporters to be checked can be found in
output.top_reporters_not_whitelisted
, and new corrections can be added tolegalhist.reporters_citation_to_cap
.[x] Distill all citations (e.g., citations from a treatise page to messy citation) down to cleaner citations to a specific reporter. This is the actual reporter (which will be useful for analytical purposes too) NOT the CAP reporter just yet. (It is essential that this be able to be joined back to the table of citations.)
[x] Correct the cleaner citation where there is a nominate reporter with different volume numbering. The connection is available in
legalist.reporters_alt_diffvols_reporters
.[x] Link citations to CAP which can be easily joined. This includes citations where the reporter matches CAP's reporter, and reporters where the volume and page numbers are the same. This link happens via the
cap.citations
table, which provides the CAP case ID. This is usually going to be done as a generated table, though a view would maybe be more updatable. This must exclude citations where the reporter is different.[ ] Finally, a list of citations which could not be linked automatically where a human figures out the case ID. Ideally this would not be in the same table as above, preserving it from the need to regenerate one of the tables above. (Perhaps, some kind of
COALESCE
clause on a join so this table gets priority.)[x] Profit.