Create and use de-identified research databases. Preprocess, extract text, anonymise/de-identify, link, apply natural language processing, query for research, manage consent for contact.
GNU General Public License v3.0
19
stars
7
forks
source link
The list of names created by crate_fetch_wordlists excludes common names that are uncommon English words #108
and there's the problem. Not only is "John" a word in this list, but so is "Veronica".
We could use a less thorough dictionary, maybe? Or not subtract English words at all? Our prototypical problems are things like Parkinson (eponym) and Plumber (name but also word).
It may be better to sacrifice potentially relevant details in the record rather than allow identifiable data through, by not subtracting English words at all.
However, here is a list of things in common (names that are words):
excellent, fought, friend, games, he, hope, husband, joyful, kitten, knuckle, libel, limp, lovely, man, memory, mood, music, no, power, powers, sad, stress, true, yes, young, zone
We could say "only low-frequency names"; there's already a cumulative frequency option. With a debugging option, there are thresholds like these:
forenames are OK, e.g. a cumulative frequency maximum of 99% would get rid of most of names-that-are-useful-words:
It's awkward! There's not really a good substitute for having the names in the structured data. Some of these would be problematic to remove (inc. for our NLP tools).
Perhaps I've been thinking about frequencies in the wrong place -- likely we need "all names" minus "common English words that are not proper nouns". I'll look at that.
The filtered names produced exactly as in the "specimen usage" at:
https://crateanon.readthedocs.io/en/latest/preprocessing/index.html
which is:
and there's the problem. Not only is "John" a word in this list, but so is "Veronica".
We could use a less thorough dictionary, maybe? Or not subtract English words at all? Our prototypical problems are things like Parkinson (eponym) and Plumber (name but also word).
It may be better to sacrifice potentially relevant details in the record rather than allow identifiable data through, by not subtracting English words at all.
However, here is a list of things in common (names that are words):
excellent, fought, friend, games, he, hope, husband, joyful, kitten, knuckle, libel, limp, lovely, man, memory, mood, music, no, power, powers, sad, stress, true, yes, young, zone
We could say "only low-frequency names"; there's already a cumulative frequency option. With a debugging option, there are thresholds like these:
forenames are OK, e.g. a cumulative frequency maximum of 99% would get rid of most of names-that-are-useful-words:
but surnames are trickier:
It's awkward! There's not really a good substitute for having the names in the structured data. Some of these would be problematic to remove (inc. for our NLP tools).