Correct obvious OCR errors in the pre-processing stage (Go)

lmullen / legal-modernism

Law and legal practice modernized in the nineteenth-century United States. We are studying and visualizing the history of the modernization of American law.

https://legalmodernism.org

MIT License

4 stars 0 forks source link

Correct obvious OCR errors in the pre-processing stage (Go) #36

Closed lmullen closed 2 years ago

lmullen commented 3 years ago

We want to create a table of the most obvious OCR errors. This should be a CSV file with the OCR error as one column and the correction as the other. We are only going to correct straightforward find and replace.

Note that these corrections have to happen in both Python and Go, so it is helpful to have the data separate from the function.

kfunk074 commented 3 years ago

Common OCR Errors.xlsx

Here is a start on the common errors found in Pomeroy and Dillon treatises.

kfunk074 commented 3 years ago

We are aiming to replace words instead of units where disambiguous. So correcting Tcx to Tex and Cmn to Crim independently will also correct "Tcx. Cmn. App." to "Tex. Crim. App." In some cases, we must include the whole unit, so N. II. becomes N. H.

These corrections are only disambiguous within the context of a citation. That is, it would be optimal if the find and replace were only run only within the reporter fields found by our general reg ex citation detector.

lmullen commented 3 years ago

The OCR errors CSV is in the repo at data/ocr-errors.csv. https://github.com/lmullen/legal-modernism/blob/eyecite/data/ocr-errors.csv

kfunk074 commented 3 years ago

I can update the OCR errors CSV if I can get a list of general regex citations in an English treatise: https://github.com/lmullen/legal-modernism/issues/42#issuecomment-967751321

lmullen commented 2 years ago

The code for this works fine now. It's just a question of compiling the list of corrections. That's really a separate issue.