ucam-department-of-psychiatry / crate

Create and use de-identified research databases. Preprocess, extract text, anonymise/de-identify, link, apply natural language processing, query for research, manage consent for contact.
GNU General Public License v3.0
19 stars 7 forks source link

The list of names created by crate_fetch_wordlists excludes common names that are uncommon English words #108

Closed martinburchell closed 2 years ago

martinburchell commented 2 years ago

The filtered names produced exactly as in the "specimen usage" at:

https://crateanon.readthedocs.io/en/latest/preprocessing/index.html

which is:

and there's the problem. Not only is "John" a word in this list, but so is "Veronica".

We could use a less thorough dictionary, maybe? Or not subtract English words at all? Our prototypical problems are things like Parkinson (eponym) and Plumber (name but also word).

It may be better to sacrifice potentially relevant details in the record rather than allow identifiable data through, by not subtracting English words at all.

However, here is a list of things in common (names that are words):

excellent, fought, friend, games, he, hope, husband, joyful, kitten, knuckle, libel, limp, lovely, man, memory, mood, music, no, power, powers, sad, stress, true, yes, young, zone

We could say "only low-frequency names"; there's already a cumulative frequency option. With a debugging option, there are thresholds like these:

forenames are OK, e.g. a cumulative frequency maximum of 99% would get rid of most of names-that-are-useful-words:


2022-08-09 14:44:11.117 crate_anon.anonymise.fetch_wordlists:WARNING: 'John' (freq. 1.488242824478893%, cumulative freq. 2.985716961254468%)
2022-08-09 14:44:11.118 crate_anon.anonymise.fetch_wordlists:WARNING: 'Veronica' (freq. 0.06308697109126536%, cumulative freq. 60.66807005639854%)
2022-08-09 14:44:11.119 crate_anon.anonymise.fetch_wordlists:WARNING: 'Hope' (freq. 0.02519781094282285%, cumulative freq. 75.15731115420549%)
2022-08-09 14:44:11.133 crate_anon.anonymise.fetch_wordlists:WARNING: 'Young' (freq. 0.0007929553590484596%, cumulative freq. 94.68736598651715%)
2022-08-09 14:44:11.141 crate_anon.anonymise.fetch_wordlists:WARNING: 'True' (freq. 0.0004983544478353606%, cumulative freq. 95.87775451888258%)
2022-08-09 14:44:11.142 crate_anon.anonymise.fetch_wordlists:WARNING: 'Lovely' (freq. 0.0004751346715821114%, cumulative freq. 95.99453809330109%)
2022-08-09 14:44:11.147 crate_anon.anonymise.fetch_wordlists:WARNING: 'Memory' (freq. 0.00035206985743989073%, cumulative freq. 96.63703713890283%)
2022-08-09 14:44:11.195 crate_anon.anonymise.fetch_wordlists:WARNING: 'Man' (freq. 7.923748646421284e-05%, cumulative freq. 98.88421660873644%)
2022-08-09 14:44:11.217 crate_anon.anonymise.fetch_wordlists:WARNING: 'Friend' (freq. 2.1478293034255495e-05%, cumulative freq. 99.66194356777439%)
2022-08-09 14:44:11.223 crate_anon.anonymise.fetch_wordlists:WARNING: 'Kitten' (freq. 1.5673348970943197e-05%, cumulative freq. 99.75322950589111%)
2022-08-09 14:44:11.241 crate_anon.anonymise.fetch_wordlists:WARNING: 'Joyful' (freq. 5.514696860146681e-06%, cumulative freq. 99.91643957169987%)
2022-08-09 14:44:11.242 crate_anon.anonymise.fetch_wordlists:WARNING: 'No' (freq. 5.224449656981067e-06%, cumulative freq. 99.91929415294076%)
2022-08-09 14:44:11.242 crate_anon.anonymise.fetch_wordlists:WARNING: 'Powers' (freq. 4.934202453815452e-06%, cumulative freq. 99.9216228062494%)
2022-08-09 14:44:11.243 crate_anon.anonymise.fetch_wordlists:WARNING: 'Power' (freq. 4.934202453815452e-06%, cumulative freq. 99.92721819182891%)
2022-08-09 14:44:11.249 crate_anon.anonymise.fetch_wordlists:WARNING: 'Music' (freq. 3.4829664379873775e-06%, cumulative freq. 99.9491315654163%)
2022-08-09 14:44:11.251 crate_anon.anonymise.fetch_wordlists:WARNING: 'Games' (freq. 2.9024720316561478e-06%, cumulative freq. 99.95915293059741%)
2022-08-09 14:44:11.253 crate_anon.anonymise.fetch_wordlists:WARNING: 'You' (freq. 2.9024720316561478e-06%, cumulative freq. 99.96296677884928%)

but surnames are trickier:


*2022-08-09 14:44:15.419 crate_anon.anonymise.fetch_wordlists:WARNING: 'YOUNG' (freq. 0.193%, cumulative freq. 10.09%)**
*2022-08-09 14:44:15.420 crate_anon.anonymise.fetch_wordlists:WARNING: 'POWERS' (freq. 0.035%, cumulative freq. 29.991%)
2022-08-09 14:44:15.423 crate_anon.anonymise.fetch_wordlists:WARNING: 'JOHN' (freq. 0.01%, cumulative freq. 45.311%)
*2022-08-09 14:44:15.426 crate_anon.anonymise.fetch_wordlists:WARNING: 'HOPE' (freq. 0.007%, cumulative freq. 50.632%)**
**2022-08-09 14:44:15.426 crate_anon.anonymise.fetch_wordlists:WARNING: 'FRIEND' (freq. 0.007%, cumulative freq. 51.511%)**
*2022-08-09 14:44:15.429 crate_anon.anonymise.fetch_wordlists:WARNING: 'POWER' (freq. 0.005%, cumulative freq. 55.573%)
*2022-08-09 14:44:15.433 crate_anon.anonymise.fetch_wordlists:WARNING: 'TRUE' (freq. 0.003%, cumulative freq. 59.0%)**
*2022-08-09 14:44:15.443 crate_anon.anonymise.fetch_wordlists:WARNING: 'LOVELY' (freq. 0.002%, cumulative freq. 65.123%)
2022-08-09 14:44:15.445 crate_anon.anonymise.fetch_wordlists:WARNING: 'MUSIC' (freq. 0.002%, cumulative freq. 65.959%)
*2022-08-09 14:44:15.465 crate_anon.anonymise.fetch_wordlists:WARNING: 'HUSBAND' (freq. 0.001%, cumulative freq. 72.016%)**
**2022-08-09 14:44:15.466 crate_anon.anonymise.fetch_wordlists:WARNING: 'HE' (freq. 0.001%, cumulative freq. 72.177%)**
*2022-08-09 14:44:15.473 crate_anon.anonymise.fetch_wordlists:WARNING: 'GAMES' (freq. 0.001%, cumulative freq. 73.548%)
*2022-08-09 14:44:15.476 crate_anon.anonymise.fetch_wordlists:WARNING: 'MAN' (freq. 0.001%, cumulative freq. 74.197%)**
**2022-08-09 14:44:15.480 crate_anon.anonymise.fetch_wordlists:WARNING: 'YOU' (freq. 0.001%, cumulative freq. 74.69%)**
**2022-08-09 14:44:15.530 crate_anon.anonymise.fetch_wordlists:WARNING: 'NO' (freq. 0.0%, cumulative freq. 80.171%)**
**2022-08-09 14:44:15.530 crate_anon.anonymise.fetch_wordlists:WARNING: 'MOOD' (freq. 0.0%, cumulative freq. 80.178%)**
*2022-08-09 14:44:15.560 crate_anon.anonymise.fetch_wordlists:WARNING: 'FOUGHT' (freq. 0.0%, cumulative freq. 82.225%)
*2022-08-09 14:44:15.585 crate_anon.anonymise.fetch_wordlists:WARNING: 'MEMORY' (freq. 0.0%, cumulative freq. 83.535%)**
*2022-08-09 14:44:15.597 crate_anon.anonymise.fetch_wordlists:WARNING: 'LIBEL' (freq. 0.0%, cumulative freq. 84.12%)
2022-08-09 14:44:15.598 crate_anon.anonymise.fetch_wordlists:WARNING: 'KITTEN' (freq. 0.0%, cumulative freq. 84.158%)
2022-08-09 14:44:15.751 crate_anon.anonymise.fetch_wordlists:WARNING: 'ZONE' (freq. 0.0%, cumulative freq. 89.234%)

It's awkward! There's not really a good substitute for having the names in the structured data. Some of these would be problematic to remove (inc. for our NLP tools).

RudolfCardinal commented 2 years ago

Perhaps I've been thinking about frequencies in the wrong place -- likely we need "all names" minus "common English words that are not proper nouns". I'll look at that.

RudolfCardinal commented 2 years ago

My attempt at https://github.com/ucam-department-of-psychiatry/crate/pull/109