SystmOne data dictionary

Command line:
- Split out standalone commands, as the crate_anonymise command was becoming confusingly multi-purpose:
- crate_anonymise --count becomes crate_anon_show_counts;
- crate_anonymise --democonfig becomes crate_anon_demo_config;
- crate_anonymise --checkextractor becomes crate_anon_check_text_extractor;
- crate_anonymise --draftdd and crate_anonymise --incrementaldd become crate_anon_draft_dd.
- crate_anon_summarize_dd tool.
- Change some hyphens to underscores in the command-line arguments to the PCMIS and RiO preprocessing tools, for consistency.
Help:
- Index of all CRATE commands.
Data dictionaries and automatic data dictionary generation:
- Full support for data dictionaries in CSV, ODS, and XLSX format, as well as the existing TSV. (Uses the first spreadsheet of a potentially multi-sheet file when reading.)
- Support SystmOne data dictionaries.
- ddgen_force_lower_case default changed from True to False.
- ddgen_min_length_for_scrubbing default changed from 0 to 50.
- New ddgen_freetext_index_min_length option.
- Fulltext indexing during data dictionary autogeneration now bases its decisions on the source (not destination) datatype. This handles the "auto-expansion" better -- otherwise all sorts of things were attracting the full-text flag.
- Remove warnings about lack of primary PID field in source tables with an MPID if no scrubbing is required (that's an inconvenience, not a de-identification risk).
- Use DataDictionary.get_pid_name instead of ddgen_per_table_pid_field to establish the PID field for each table for scrubbing. The ddgen options should only be for generating a data dictionary; the user may have revised the data dictionary subsequently, and there is no requirement that all PID fields have the same name across tables.
- Add data dictionary check that all scrub-source tables have a patient ID field.
- Remove ddgen_allow_no_patient_info option and replace it with allow_no_patient_info -- this is now a "runtime" setting, not a "data dictionary definition" setting. Depending on allow_no_patient_info, warnings or errors are produced if a data dictionary is used without patient-defining information (which is usually wrong, but there are sometimes sensible use-cases for it).
- Option for ddgen_min_length_for_scrubbing to be less than 1 to disable scrubbing entirely (helpful for the SystmOne automatic data dictionary generation).
- Add data dictionary row check that "add source hash" (H) flag fields are not omitted, as promised in the documentation.
- Autodetect primary keys.
Anonymisation:
- New scrub method: phrase_unless_numeric.
- Efficiency check when recursing into third-party records, to avoid doing the same work twice.
- Automatically hash third-party PIDs using the same hasher as patient PIDs, rendering the de-identified records linkable (if and only if the third-party PID field is marked for inclusion).
- denylist_files_as_phrases option for anonymisation, and denylist_phrases_flexible_whitespace.
- Fix :class:crate_anon.anonymise.scrub.WordList to use its suffixes parameter even if regex_method is False. (Was not being used.)
- Ensure that if MRIDs are being automatically (option add_mrid_wherever_rid_added), but a table has an explicit MPID/MRID data dictionary row with the same name, that we don't attempt to add it twice.
- Make primary key columns (which are already detected and/or configured by the user) explicitly NOT NULL on the destination, which allows free-text indexing. Replicate source NOT NULL status, allowing the user to control this via a source flag, for other column types.
- Add support for SQL column comments (supported since SQLAlchemy 1.2).
- Drop all tables known to the data dictionary (not just tables with included content), to avoid leaving orphan tables when the data dictionary is altered to OMIT everything in a table. As before, only active tables are created.
- Allow secret table PID/MPID types to be integer despite string source fields, giving a warning only. This is acceptable if the source fields do in fact contain only integers-as-strings, e.g. '123'.
- When dates are truncated, (a) ensure time fields are zero, and (b) default (during data dictionary drafting) to a DATE field, in case the source is DATETIME.
- Fix scrubber order (in :meth:crate_anon.anonymise.scrub.PersonalizedScrubber.scrub). Was (1) nonspecific, (2) patient, (3) third party. Now (1) patient, (2) third party, (3) nonspecific. This provides some more information to the user about the subject of a sentence.
- Option to scrub all dates: :ref:scrub_all_dates <scrub_all_dates>.
- This does not presently do generic date "blurring"; blurring to year is very imprecise, while blurring to month is quite susceptible to information discovery around month boundaries. However, if required, this could be implemented -- likely not by a simple textual replace using named capture groups for the parts to preserve, but by named capture groups followed by date parsing followed by date-writing in a standard, e.g. ISO, format.

ucam-department-of-psychiatry / crate

SystmOne data dictionary #69