ucam-department-of-psychiatry / crate

Create and use de-identified research databases. Preprocess, extract text, anonymise/de-identify, link, apply natural language processing, query for research, manage consent for contact.
GNU General Public License v3.0
19 stars 7 forks source link

Filter name lists by common English words #109

Closed RudolfCardinal closed 1 year ago

RudolfCardinal commented 1 year ago

Fetch all words from a subset of Project Gutenberg books. Filter, e.g. must start with a lower-case letter; >=2 characters. This should remove some usages as names (although also starts of sentences, of course). Then (in the specimen usage), trim to the top 99% and overlap with valid Scrabble/crossword clues; this gives a list of common English words. Take all US forenames/surnames, subtract eponyms, subtract the new list of common English words, and use the result as our name exclusion file.

This gets rid of e.g. John, Veronica, but doesn't exclude e.g. excellent, moody.