Create and use de-identified research databases. Preprocess, extract text, anonymise/de-identify, link, apply natural language processing, query for research, manage consent for contact.
crate_anonymise --draftdd and crate_anonymise --incrementaldd
become crate_anon_draft_dd.
crate_anon_summarize_dd tool.
Change some hyphens to underscores in the command-line arguments to the
PCMIS and RiO preprocessing tools, for consistency.
Help:
Index of all CRATE commands.
Data dictionaries and automatic data dictionary generation:
Full support for data dictionaries in CSV, ODS, and XLSX format, as well as
the existing TSV. (Uses the first spreadsheet of a potentially multi-sheet
file when reading.)
Support SystmOne data dictionaries.
ddgen_force_lower_case default changed from True to False.
ddgen_min_length_for_scrubbing default changed from 0 to 50.
New ddgen_freetext_index_min_length option.
Fulltext indexing during data dictionary autogeneration now bases its
decisions on the source (not destination) datatype. This handles the
"auto-expansion" better -- otherwise all sorts of things were attracting
the full-text flag.
Remove warnings about lack of primary PID field in source tables with an
MPID if no scrubbing is required (that's an inconvenience, not a
de-identification risk).
Use DataDictionary.get_pid_name instead of
ddgen_per_table_pid_field to establish the PID field for each table for
scrubbing. The ddgen options should only be for generating a data
dictionary; the user may have revised the data dictionary subsequently, and
there is no requirement that all PID fields have the same name across
tables.
Add data dictionary check that all scrub-source tables have a patient ID
field.
Remove ddgen_allow_no_patient_info option and replace it with
allow_no_patient_info -- this is now a "runtime" setting, not a "data
dictionary definition" setting. Depending on allow_no_patient_info,
warnings or errors are produced if a data dictionary is used without
patient-defining information (which is usually wrong, but there are
sometimes sensible use-cases for it).
Option for ddgen_min_length_for_scrubbing to be less than 1 to disable
scrubbing entirely (helpful for the SystmOne automatic data dictionary
generation).
Add data dictionary row check that "add source hash" (H) flag fields are
not omitted, as promised in the documentation.
Autodetect primary keys.
Anonymisation:
New scrub method: phrase_unless_numeric.
Efficiency check when recursing into third-party records, to avoid doing
the same work twice.
Automatically hash third-party PIDs using the same hasher as patient PIDs,
rendering the de-identified records linkable (if and only if the
third-party PID field is marked for inclusion).
denylist_files_as_phrases option for anonymisation, and
denylist_phrases_flexible_whitespace.
Fix :class:crate_anon.anonymise.scrub.WordList to use its suffixes
parameter even if regex_method is False. (Was not being used.)
Ensure that if MRIDs are being automatically (option
add_mrid_wherever_rid_added), but a table has an explicit MPID/MRID
data dictionary row with the same name, that we don't attempt to add it
twice.
Make primary key columns (which are already detected and/or configured by
the user) explicitly NOT NULL on the destination, which allows free-text
indexing. Replicate source NOT NULL status, allowing the user to control
this via a source flag, for other column types.
Add support for SQL column comments (supported since SQLAlchemy 1.2).
Drop all tables known to the data dictionary (not just tables with
included content), to avoid leaving orphan tables when the data dictionary
is altered to OMIT everything in a table. As before, only active tables are
created.
Allow secret table PID/MPID types to be integer despite string source
fields, giving a warning only. This is acceptable if the source fields do
in fact contain only integers-as-strings, e.g. '123'.
When dates are truncated, (a) ensure time fields are zero, and (b) default
(during data dictionary drafting) to a DATE field, in case the source is
DATETIME.
Fix scrubber order (in
:meth:crate_anon.anonymise.scrub.PersonalizedScrubber.scrub). Was (1)
nonspecific, (2) patient, (3) third party. Now (1) patient, (2) third
party, (3) nonspecific. This provides some more information to the user
about the subject of a sentence.
Option to scrub all dates: :ref:scrub_all_dates <scrub_all_dates>.
This does not presently do generic date "blurring"; blurring to year is
very imprecise, while blurring to month is quite susceptible to
information discovery around month boundaries. However, if required, this
could be implemented -- likely not by a simple textual replace using
named capture groups for the parts to preserve, but by named capture
groups followed by date parsing followed by date-writing in a standard,
e.g. ISO, format.
Command line:
Split out standalone commands, as the
crate_anonymise
command was becoming confusingly multi-purpose:crate_anonymise --count
becomescrate_anon_show_counts
;crate_anonymise --democonfig
becomescrate_anon_demo_config
;crate_anonymise --checkextractor
becomescrate_anon_check_text_extractor
;crate_anonymise --draftdd
andcrate_anonymise --incrementaldd
becomecrate_anon_draft_dd
.crate_anon_summarize_dd
tool.Change some hyphens to underscores in the command-line arguments to the PCMIS and RiO preprocessing tools, for consistency.
Help:
Data dictionaries and automatic data dictionary generation:
Full support for data dictionaries in CSV, ODS, and XLSX format, as well as the existing TSV. (Uses the first spreadsheet of a potentially multi-sheet file when reading.)
Support SystmOne data dictionaries.
ddgen_force_lower_case
default changed from True to False.ddgen_min_length_for_scrubbing
default changed from 0 to 50.New
ddgen_freetext_index_min_length
option.Fulltext indexing during data dictionary autogeneration now bases its decisions on the source (not destination) datatype. This handles the "auto-expansion" better -- otherwise all sorts of things were attracting the full-text flag.
Remove warnings about lack of primary PID field in source tables with an MPID if no scrubbing is required (that's an inconvenience, not a de-identification risk).
Use
DataDictionary.get_pid_name
instead ofddgen_per_table_pid_field
to establish the PID field for each table for scrubbing. Theddgen
options should only be for generating a data dictionary; the user may have revised the data dictionary subsequently, and there is no requirement that all PID fields have the same name across tables.Add data dictionary check that all scrub-source tables have a patient ID field.
Remove
ddgen_allow_no_patient_info
option and replace it withallow_no_patient_info
-- this is now a "runtime" setting, not a "data dictionary definition" setting. Depending onallow_no_patient_info
, warnings or errors are produced if a data dictionary is used without patient-defining information (which is usually wrong, but there are sometimes sensible use-cases for it).Option for
ddgen_min_length_for_scrubbing
to be less than 1 to disable scrubbing entirely (helpful for the SystmOne automatic data dictionary generation).Add data dictionary row check that "add source hash" (H) flag fields are not omitted, as promised in the documentation.
Autodetect primary keys.
Anonymisation:
New scrub method:
phrase_unless_numeric
.Efficiency check when recursing into third-party records, to avoid doing the same work twice.
Automatically hash third-party PIDs using the same hasher as patient PIDs, rendering the de-identified records linkable (if and only if the third-party PID field is marked for inclusion).
denylist_files_as_phrases
option for anonymisation, anddenylist_phrases_flexible_whitespace
.Fix :class:
crate_anon.anonymise.scrub.WordList
to use itssuffixes
parameter even ifregex_method
is False. (Was not being used.)Ensure that if MRIDs are being automatically (option
add_mrid_wherever_rid_added
), but a table has an explicit MPID/MRID data dictionary row with the same name, that we don't attempt to add it twice.Make primary key columns (which are already detected and/or configured by the user) explicitly NOT NULL on the destination, which allows free-text indexing. Replicate source NOT NULL status, allowing the user to control this via a source flag, for other column types.
Add support for SQL column comments (supported since SQLAlchemy 1.2).
Drop all tables known to the data dictionary (not just tables with included content), to avoid leaving orphan tables when the data dictionary is altered to OMIT everything in a table. As before, only active tables are created.
Allow secret table PID/MPID types to be integer despite string source fields, giving a warning only. This is acceptable if the source fields do in fact contain only integers-as-strings, e.g. '123'.
When dates are truncated, (a) ensure time fields are zero, and (b) default (during data dictionary drafting) to a DATE field, in case the source is DATETIME.
Fix scrubber order (in :meth:
crate_anon.anonymise.scrub.PersonalizedScrubber.scrub
). Was (1) nonspecific, (2) patient, (3) third party. Now (1) patient, (2) third party, (3) nonspecific. This provides some more information to the user about the subject of a sentence.Option to scrub all dates: :ref:
scrub_all_dates <scrub_all_dates>
.This does not presently do generic date "blurring"; blurring to year is very imprecise, while blurring to month is quite susceptible to information discovery around month boundaries. However, if required, this could be implemented -- likely not by a simple textual replace using named capture groups for the parts to preserve, but by named capture groups followed by date parsing followed by date-writing in a standard, e.g. ISO, format.