lmullen / legal-modernism

Law and legal practice modernized in the nineteenth-century United States. We are studying and visualizing the history of the modernization of American law.
https://legalmodernism.org
MIT License
4 stars 0 forks source link

Check whether the cite detector is finding good citations #67

Closed lmullen closed 1 year ago

lmullen commented 2 years ago

@kfunk074 asked for the citations we've found so far for these treatises.

lmullen commented 2 years ago

The citations detected are attached. It's best to think about the information here as having two parts.

  1. The citation detected (boils down to the raw field). The citations detected are not entirely final, but reasonably close.
  2. The link between the citation and the case (the case field contains the CAP case ID). This was done quick and dirty, without any guessing for page numbers and so forth. So treat this as temporary data.

If there is a reporter_abbr field that looks like a reporter (i.e. not obviously junk) and there is not an entry in reporter_cap, then we want to add that to the whitelist.

If there is an entry in the reporter_capfield (and thus in thecleaner_citefield) but not in thecase` field it would be helpful to know the categories of problems that are causing that. I.e., is the volume number wrong? The page number? Is the page number wrong in the actual text? Or is it an OCR error. Stuff like that.

F0102826781.csv F0103272617.csv F0103761500.csv F0105827925.csv F0151947538.csv

kfunk074 commented 2 years ago

On the file ending in 925, the detector returned 5 clean cites out of 2,500 hits on the regular expression. Is there a clean-up process that didn't run? On a quick skim, most of the cites have an extraneous comma.

Similarly, the file ending in 781 returned 54 whitelisted cites, but out of a much smaller denominator (500). The other treatises returned whitelisted cites in the thousands.

lmullen commented 2 years ago

That is a bit strange. Part of it can be explained by the fact that there are reporters without whitelists. But not all, and not that much. I did something janky to write the query quickly. Let me try again a different way.

lmullen commented 2 years ago

Ha. Stupid bug.

lmullen commented 2 years ago

F0102826781.csv F0103272617.csv F0103761500.csv F0105827925.csv F0151947538.csv

Those are revised versions. I think you will find them more to your liking. For uninteresting reasons, I was matching the reporter only when the reporter we found exactly matched the one in CAP, which is obviously not what we wanted to do. The mystery is not why the others were so bad but why the first one worked at all. I also improved some other things.

lmullen commented 2 years ago

Here are twenty sample "citations" for each of the reporters to be whitelisted. The reporters are ordered by frequency, so best to start at the top and work down. Each "citation" has a link to a treatise in MOML, along with the page number of the citation. To make this file size manageable, it is only "reporters" that appear at least 1000 times, not 100 times as in the treatise-by-treatise files above.

Examples of citations to reporterts not whitelisted.csv

lmullen commented 2 years ago

The following files supersede all those above. They have been updated to include a better list of what has already been whitelisted.

These are the top reporters still to be whitelisted, with sample citations for them. Top reporters not yet whitelisted.csv Sample citations for top reporters not yet whitelisted.csv

And these are the citations found for specific treatises, with the quick and temporary way of linking them to CAP cases. F0102826781.csv F0103272617.csv F0103761500.csv F0105827925.csv F0151947538.csv

kfunk074 commented 2 years ago

RA has checked F010327617. Out of 166 citations on the page, we had 8 errors and 7 missing citations, for an overall accuracy of 91% (88% if we ignore statutory cites). Not bad. Only one of the errors was a number error. All the others were text OCR problems that will be resolved on our next round of whitelisting.

The only systemic issue is that we consistently missed citations to Howell's Special Term Reports, which look like this: 1 How. Sp. T. Rep. 114. I checked the OCR on the MOML pages and it's flawless. We did pick up citations to the reporter when there was an OCR error, for instance How. Sp. T. Rep (no terminal period) and How. Sp.T. Rep. (missing internal space) are both in the detected citations list. Is there an alpha character limit or something that's excluding the full cites?

kfunk074 commented 2 years ago

Slight amendment: 3 cites were missed because the volume number 1 was OCR'd as I or ]. Not worth addressing right now, but we can keep in mind that these kinds of errors happen.

lmullen commented 2 years ago

As you surmised, that abbreviation is 16 characters long, but the regex limited the abbreviations to 15 characters. I have made a change and added a test in https://github.com/lmullen/legal-modernism/commit/914bbeb2766f43066632460ebb4e0ae178498c7e

kfunk074 commented 2 years ago

RA has spot checked the other four treatises. Overall, we have an initial accuracy rate of 75%. That will improve somewhat with the next round of whitelisting. A couple of the treatises just have really bad OCR on the numbers and there's nothing our current approach can do about it. Column breaks in footnotes threw a lot of things off, too. The good news is that the treatise with the most column break errors also used parallel cites abundantly, so if we're just trying to draw one edge between a treatise and a cited case, we probably have an average of four different chances to get it right between the parallel cites and the table of authorities.

Only systemic issue that arose is we've identified some more single-volume reporters. My recollection is you're keeping a separate library of those. If so, these should be added: Baldw., Comst., Cro. Car., Hob., Palm., Peake's Cas. All except the first appear to be UK reporters.

kfunk074 commented 1 year ago

reporters_citation_to_cap_whitelist-4.csv

Update to the whitelist, covering a little over 1,000 entries and 5 million "cites." Let me know if there's a better format, but this is the format used for the previous whitelisting efforts.

lmullen commented 1 year ago

@kfunk074 I've added this to the database.

I'm not sure why, but about 50 of the entries were duplicated. I cleaned it up, and it's unlikely to happen again, so I wouldn't worry about it.

kfunk074 commented 1 year ago

Another 1,000 entries to update the whitelist, hopefully with few duplicates, but there are probably some. This should cover all reporters cited more than 5,000 times (there may have been a troubling entry or two we left for further investigation).

reporters_citation_to_cap_whitelist-5.csv

kfunk074 commented 1 year ago

At this point, to avoid duplicates, we should probably start with a fresh list of non-yet-whitelisted reporters cited more than 500 times. And the example entries have been really helpful, so we should provide an updated sheet of 10 random examples per each reporter for Sean as well (can be limited to more than 750 or 1000 cites if 500 is too large).

kfunk074 commented 1 year ago

Also a quick update on the spot checking: The treatise mentioned above that had a lot of column errors but also a lot of parallel cites--of the 80 errors, the underlying cases were cited and accurately identified elsewhere in the treatise 40 times. If the aim is just to draw one edge between the treatise and the case, our current method achieved an overall accuracy of 86% on the treatise that had the most errors. Not terrible considering the rate will improve a little more with whitelisting.

We do need to keep this in mind when citations do not align with CAP, though. Column break errors are different from OCR errors and may lead us into correcting ourselves into false citations.

That's the end of our spot-checking for those five treatises. Maybe after the next whitelist update you could give us five more random treatises to check?

kfunk074 commented 1 year ago

To collate the above into a task list: