ucam-department-of-psychiatry / crate

Create and use de-identified research databases. Preprocess, extract text, anonymise/de-identify, link, apply natural language processing, query for research, manage consent for contact.
GNU General Public License v3.0
19 stars 7 forks source link

Ability to customise the scrubbed text replacement #82

Closed martinburchell closed 2 years ago

martinburchell commented 2 years ago

Feedback from today's Turing meeting after I demoed the anonymisation API:

Could we customise the scrubbed text so that it reads something like [__FORENAME__] [__SURNAME__] was born on [__DOB__] and lives at [__ADDRESS__] instead of [__PPP__] throughout? The thinking was it would be easier for both humans and machines to understand the scrubbed text.

Is there a risk of making the original text easier to guess if there are too many clues?

I can imagine for the API we could pass in the replacement text along with the terms to be searched.

RudolfCardinal commented 2 years ago

Possibly... We'd either have to define these in the data dictionary (patient data to be scrubbed / third-party data to be scrubbed / new specific other form of data to be scrubbed) or across the DD and config file. I think the reasons to be careful are:

  1. Identifiability. These problems aren't absent as things stand, but they might become more obvious (and human-obviousness is an important test):
string_max_regex_errors = 2

[__SURNAME__] a song of sixpence, a pocket full of [__ADDRESS__];
Four and twenty blackbirds, baked in a pie;
When the pie was opened, the birds began to [__SURNAME__];
Wasn't that a [__FORENAME__] dish to set before a [__SURNAME__]!

... gives you forename, surname, town with fairly high confidence.

  1. Speed. An additional regex is required per substitution.

  2. Complexity, e.g.

    • Currently, substitutions are defined in the config, and selected by replace_patient_info_with, replace_third_party_info_with, and replace_nonspecific_info_with (plus scrub_src in the data dictionary). You'd need additional replacement options either directly within the data dictionary (potentially leading to inconsistencies across a large DD) or a mapping in the DD to replacement options defined in the config.
    • Users might ask: why not a similar thing for nonspecific info? Fair, but then multiple things to define (e.g. postcodes, 10-digit numbers, ...).

So: maybe, but I think there'd need to be a good reason. Much of the information is contextual anyway, e.g. I saw Mr [___PPP___] today is probably a name. Has anyone come up with an NLP problem that can't cope with the current setup but could cope with the more specific version?

RudolfCardinal commented 2 years ago

As discussed 19/4/22 -- no specific use case as yet. Therefore: park/close for now. Can always revisit if a NLP requirement arises for this feature.