lmullen / legal-modernism

Law and legal practice modernized in the nineteenth-century United States. We are studying and visualizing the history of the modernization of American law.
https://legalmodernism.org
MIT License
4 stars 0 forks source link

Normalize reporter names found from generic reporter regex #46

Closed lmullen closed 2 years ago

lmullen commented 2 years ago

A common problem is that the generic regex for reporters finds reporter names which need to be cleaned up.

Example tests for checking this cleaning are here: https://github.com/lmullen/legal-modernism/blob/issue44-Refactor-the-code-base-to-allow-for-tests/modularity/go/citations/citation_test.go#L18

lmullen commented 2 years ago

This probably should be done in only a minimal way in the citation detector itself. This is a problem for OCR correction on the one end, or citation reconciliation on the other end, where the data about corrections can be used in both Go and Python and where there is more room for human intervention to correct obvious mistakes that are not obviously spelled out in regex.

lmullen commented 2 years ago

This is done now. The remainder will happen at the cleanup stage.