Closed lmullen closed 2 years ago
Here is a start on the common errors found in Pomeroy and Dillon treatises.
We are aiming to replace words instead of units where disambiguous. So correcting Tcx to Tex and Cmn to Crim independently will also correct "Tcx. Cmn. App." to "Tex. Crim. App." In some cases, we must include the whole unit, so N. II. becomes N. H.
These corrections are only disambiguous within the context of a citation. That is, it would be optimal if the find and replace were only run only within the reporter fields found by our general reg ex citation detector.
The OCR errors CSV is in the repo at data/ocr-errors.csv
. https://github.com/lmullen/legal-modernism/blob/eyecite/data/ocr-errors.csv
I can update the OCR errors CSV if I can get a list of general regex citations in an English treatise: https://github.com/lmullen/legal-modernism/issues/42#issuecomment-967751321
The code for this works fine now. It's just a question of compiling the list of corrections. That's really a separate issue.
We want to create a table of the most obvious OCR errors. This should be a CSV file with the OCR error as one column and the correction as the other. We are only going to correct straightforward find and replace.
Note that these corrections have to happen in both Python and Go, so it is helpful to have the data separate from the function.