Closed martinburchell closed 2 years ago
Possibly... We'd either have to define these in the data dictionary (patient data to be scrubbed / third-party data to be scrubbed / new specific other form of data to be scrubbed) or across the DD and config file. I think the reasons to be careful are:
string_max_regex_errors = 2
[__SURNAME__] a song of sixpence, a pocket full of [__ADDRESS__];
Four and twenty blackbirds, baked in a pie;
When the pie was opened, the birds began to [__SURNAME__];
Wasn't that a [__FORENAME__] dish to set before a [__SURNAME__]!
... gives you forename, surname, town with fairly high confidence.
Speed. An additional regex is required per substitution.
Complexity, e.g.
replace_patient_info_with
, replace_third_party_info_with
, and replace_nonspecific_info_with
(plus scrub_src
in the data dictionary). You'd need additional replacement options either directly within the data dictionary (potentially leading to inconsistencies across a large DD) or a mapping in the DD to replacement options defined in the config.So: maybe, but I think there'd need to be a good reason. Much of the information is contextual anyway, e.g. I saw Mr [___PPP___] today
is probably a name. Has anyone come up with an NLP problem that can't cope with the current setup but could cope with the more specific version?
As discussed 19/4/22 -- no specific use case as yet. Therefore: park/close for now. Can always revisit if a NLP requirement arises for this feature.
Feedback from today's Turing meeting after I demoed the anonymisation API:
Could we customise the scrubbed text so that it reads something like
[__FORENAME__] [__SURNAME__] was born on [__DOB__] and lives at [__ADDRESS__]
instead of[__PPP__]
throughout? The thinking was it would be easier for both humans and machines to understand the scrubbed text.Is there a risk of making the original text easier to guess if there are too many clues?
I can imagine for the API we could pass in the replacement text along with the terms to be searched.