Reconcile English/British/foreign citations

kfunk074 commented 3 years ago

CAP has only U.S. cases and does not detect citations of English (or any foreign) reporters, nor would it help much if it did, as the case text and metadata will not be in CAP.

[ ] Prepare an excel file list of foreign reporters and their common abbreviations
[ ] Construct particular reg ex detectors of foreign reporters
[x] Explore open source corpora of English reports spanning the relevant time period (1800-1920)

kfunk074 commented 3 years ago

Grajzl and Murrell's helpful guide to the English Reports, and how they constructed their database of pre-1765 case reports: http://www.econweb.umd.edu/~murrell/articles/AppendicesMachineCaselawJOIE.pdf

"The source of our data and the starting point for our corpus construction and processing was a digitized database of English Reports, obtained from Juta and Company (Pty) Ltd (English Reports (1260-1865), n.d.). The resultant database consists of 129,042 nominate reports of decisions rendered in the English courts of law between the early 13th century and the mid-19th century."

kfunk074 commented 3 years ago

English Case Reports.xlsx

Here's a start on a database of UK law reports, adapted from the second English edition (1892) of Joseph Story's Commentaries on Equity. It's probably incomplete, but hopefully not very.

kfunk074 commented 3 years ago

So with perfect OCR, we can at least use this dataset to match a citation to a UK reporter. To this point I haven't attempted to correct common OCR errors on English reports. To do that, it would be helpful to have the output of our general regex run on the Story volume I mention above.

[ ] Produce a general regex citations output for Story's English Equity, Gale ID: F0105632267

kfunk074 commented 3 years ago

86 Eng Rep 2

Image of a typical page in the English Reports. The plain text is not expensive to acquire. This page makes clear there are two complications posed by the English reports that we won't usually encounter with American reports: 1) multiple cases can be reported on a single page, meaning citation "addresses" are not unique. 2) Many private reporters had such limited runs they only produced one volume and so there is no volume signifier in the standard citation form. Neither of these derail the main project. We will either miss citations to the obscure private reporters or we can write special particular regex's to find them.

So far as I can tell, there is no CAP equivalent for UK case reports. There are things we could do to create more meaningful connections in the data, but these should all be considered back burner to the main project.

We could "section" the cases into separate texts as we did with the Field Codes. Each text could retain its "address" in the English Reports and we could try to extract the OCR of the private reporter citations with which each report begins.
The English Report volumes are divided up by jurisdiction (King's Bench, Chancery, Exchequer, etc.) and then run chronologically. An RA could prepare a database of court personnel and corresponding dates. We could then track decisional law by court and jurist as we can with CAP.
Grajzl and Murrell are trying to topic model this corpus to death. I'll get in touch to see what if anything they've done to think about citations.

lmullen commented 2 years ago

@kfunk074 Two questions about the status of this one.

Any more (much more?) to be done to create as complete a list of English reporters as reasonable?
Any reason to think these won't be picked up by our general Go cite detector? In other words, the problem isn't detection by analysis?

kfunk074 commented 2 years ago

I don’t know what I don’t know. I think it’s a pretty extensive list, and I don’t know where to look to find more, though there may well be more out there. Many are single-volume, but that’s the only hang up to finding them with a general regex search.

kfunk074 commented 2 years ago

For future reference, this database might be helpful as a UK CAP alternative. Have yet to suss out how comprehensive it is: https://swarb.co.uk/its-what-we-do/

lmullen commented 2 years ago

We have essentially detected the British citations, unless there is some reporters that fall out of the 1 Reporter 123 pattern. What we need is a process to reconcile them to useful information parallel to CAP.

kfunk074 commented 1 year ago

Not sure how I missed this before. A complete database of the English Reports appears to be here: http://www.commonlii.org/uk/cases/EngR/

It appears there are hand-keyed parallel citations that could link to our detected cases and allow us to extract at least the dates of the decisions. I'll see if they can share their datafiles.

kfunk074 commented 1 year ago

Behold, the English Reports. Turns out each case has one and only one parallel cite, so no extra table needed for that. The second table here matches up volume number to court jurisdiction. We have the full text too, just not in table form yet. Low priority to get full text I would think.

Edit: File too big. Download the csv here.

english_reports_courts_by_volume.csv

kfunk074 commented 1 year ago

A few pointers, as I review Phil's data:

The reporter_standard entries in the whitelist now match exactly the reporter abbreviations used in the English Reports. A "raw" MOML citation should match exactly the official or nominate citation from the English Reports.
The English reports give one and only one nominate citation for each official citation. I don't know if that's historically accurate but I have no evidence to doubt it either, so for now we can just embrace the simplicity.
The English Reports are comprehensive through 1866, sporadic until 1877. An entirely different set of reports, the Law Times Weekly, became the official reporter in the 1870s. I'm working with law librarians to see if a structured database of the Law Times is available, but just to be clear: the English Reports cover UK cases from 1200 to about 1870. They will only account for a fraction (half? a third?) of all cites our whitelist labels "UK." But they're comprehensive, influential, and the metadata is useful, so well-worth plugging in now while we wait to see if anything comes of the Law Times.
The good and bad news is that the metadata is far less extensive than CAP's, and the corpus far smaller. Hopefully that helps with linking. There are three tables in the drive folder linked above: The data on each case in the reports, a table of jurisdictions by volume (the printed English Reports are organized chronologically by jurisdiction), and a table of full text reports keyed to each case id (being ironed out by Phil as of 7/23 but nearly complete). We don't need to import the full text if we don't want to burden the server with a bunch of data we're not going to use for the foreseeable future.

kfunk074 commented 1 year ago

The complete, clean, final, and godly English Reports are here: https://drive.google.com/drive/folders/1QpwUQHIxzAJdeUG15CdNPioT5HBilyKY?usp=sharing

The csv file contains everything described above as well as the clean years, titles, and wordcounts from Peter Murrell's data. This is ready to integrate when you're ready to tackle the integration.

lmullen / legal-modernism

Reconcile English/British/foreign citations #42