Create and use de-identified research databases. Preprocess, extract text, anonymise/de-identify, link, apply natural language processing, query for research, manage consent for contact.
Fetch all words from a subset of Project Gutenberg books.
Filter, e.g. must start with a lower-case letter; >=2 characters. This should remove some usages as names (although also starts of sentences, of course).
Then (in the specimen usage), trim to the top 99% and overlap with valid Scrabble/crossword clues; this gives a list of common English words.
Take all US forenames/surnames, subtract eponyms, subtract the new list of common English words, and use the result as our name exclusion file.
This gets rid of e.g. John, Veronica, but doesn't exclude e.g. excellent, moody.
Fetch all words from a subset of Project Gutenberg books. Filter, e.g. must start with a lower-case letter; >=2 characters. This should remove some usages as names (although also starts of sentences, of course). Then (in the specimen usage), trim to the top 99% and overlap with valid Scrabble/crossword clues; this gives a list of common English words. Take all US forenames/surnames, subtract eponyms, subtract the new list of common English words, and use the result as our name exclusion file.
This gets rid of e.g. John, Veronica, but doesn't exclude e.g. excellent, moody.