ucam-department-of-psychiatry / crate

Create and use de-identified research databases. Preprocess, extract text, anonymise/de-identify, link, apply natural language processing, query for research, manage consent for contact.
GNU General Public License v3.0
19 stars 7 forks source link

TypeError with boolean opt out field and '1' in optout_col_values #140

Closed martinburchell closed 4 months ago

martinburchell commented 5 months ago

If optout_col_values = [True, 1, '1'] in the anonymisation configuration, and the opt-out field is a boolean field, when executing the following:

SELECT DISTINCT crate_rio_number FROM rio_manual_opt_out WHERE opt_out IN (True, 1, '1')

SQL Alchemy will raise a TypeError when checking if the string values in the IN clause could be considered boolean.

There's a failing test on the fix-optout-col-values-typeerror branch, which despite its name doesn't actually fix anything:

2024-01-23T11:10:24.1316248Z _ GenOptOutPidsFromDatabaseTests.test_string_in_optout_col_values_ignored_for_boolean_column _
2024-01-23T11:10:24.1318387Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py:1810: in _execute_context
2024-01-23T11:10:24.1319585Z     context = constructor(
2024-01-23T11:10:24.1320934Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py:1077: in _init_compiled
2024-01-23T11:10:24.1322335Z     param = [
2024-01-23T11:10:24.1326311Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py:1078: in <listcomp>
2024-01-23T11:10:24.1327892Z     processors[key](compiled_params[key])
2024-01-23T11:10:24.1329522Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/sql/sqltypes.py:2000: in process
2024-01-23T11:10:24.1330629Z     value = _strict_as_bool(value)
2024-01-23T11:10:24.1331818Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/sql/sqltypes.py:1975: in _strict_as_bool
2024-01-23T11:10:24.1333092Z     raise TypeError("Not a boolean value: %r" % (value,))
2024-01-23T11:10:24.1334296Z E   TypeError: Not a boolean value: '1'
2024-01-23T11:10:24.1334743Z 
2024-01-23T11:10:24.1335221Z The above exception was the direct cause of the following exception:
2024-01-23T11:10:24.1336775Z crate_anon/anonymise/tests/anonymise_tests.py:91: in test_string_in_optout_col_values_ignored_for_boolean_column
2024-01-23T11:10:24.1338076Z     pids = list(gen_opt_out_pids_from_database())
2024-01-23T11:10:24.1339324Z crate_anon/anonymise/anonymise.py:1806: in gen_opt_out_pids_from_database
2024-01-23T11:10:24.1340336Z     result = session.execute(query)
2024-01-23T11:10:24.1341438Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py:1717: in execute
2024-01-23T11:10:24.1342670Z     result = conn._execute_20(statement, params or {}, execution_options)
2024-01-23T11:10:24.1344123Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py:1710: in _execute_20
2024-01-23T11:10:24.1345315Z     return meth(self, args_10style, kwargs_10style, execution_options)
2024-01-23T11:10:24.1346786Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py:334: in _execute_on_connection
2024-01-23T11:10:24.1348062Z     return connection._execute_clauseelement(
2024-01-23T11:10:24.1349515Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py:1577: in _execute_clauseelement
2024-01-23T11:10:24.1350690Z     ret = self._execute_context(
2024-01-23T11:10:24.1351911Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py:1816: in _execute_context
2024-01-23T11:10:24.1352957Z     self._handle_dbapi_exception(
2024-01-23T11:10:24.1354162Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py:2134: in _handle_dbapi_exception
2024-01-23T11:10:24.1355326Z     util.raise_(
2024-01-23T11:10:24.1356230Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py:211: in raise_
2024-01-23T11:10:24.1357184Z     raise exception
2024-01-23T11:10:24.1358304Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py:1810: in _execute_context
2024-01-23T11:10:24.1359390Z     context = constructor(
2024-01-23T11:10:24.1360448Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py:1077: in _init_compiled
2024-01-23T11:10:24.1361632Z     param = [
2024-01-23T11:10:24.1362656Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py:1078: in <listcomp>
2024-01-23T11:10:24.1363687Z     processors[key](compiled_params[key])
2024-01-23T11:10:24.1364939Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/sql/sqltypes.py:2000: in process
2024-01-23T11:10:24.1365947Z     value = _strict_as_bool(value)
2024-01-23T11:10:24.1367047Z ../../../venv/lib/python3.8/site-packages/sqlalchemy/sql/sqltypes.py:1975: in _strict_as_bool
2024-01-23T11:10:24.1368324Z     raise TypeError("Not a boolean value: %r" % (value,))
2024-01-23T11:10:24.1369538Z E   sqlalchemy.exc.StatementError: (builtins.TypeError) Not a boolean value: '1'
2024-01-23T11:10:24.1370499Z E   [SQL: SELECT DISTINCT pid 
2024-01-23T11:10:24.1371209Z E   FROM test_opt_out 
2024-01-23T11:10:24.1371848Z E   WHERE opt_out IN (__[POSTCOMPILE_opt_out_1])]
2024-01-23T11:10:24.1372541Z E   [parameters: [{}]]

Possible options:

  1. Do nothing. It's user error, though not particularly helpful to find out several hours into anonymisation.
  2. Validate the config file and abort early with more helpful error message.
  3. Do some clever preprocessing so only sensible values of optout_col_values are considered for boolean fields

I suppose it's conceivable that there may be a boolean opt-out field in one table, where True or 1 means opt-out and a string opt-out field in another table, where '1' or 'yes' means opt-out.

@RudolfCardinal any thoughts?

RudolfCardinal commented 5 months ago

It seems a bit over-conservative on SQLAlchemy's part. Presumably it reflects the column type from the database before this point? The conservatism includes e.g.

There's some utility in being able to specify the opt-outs quite broadly in the config file, I think, as for the example you suggest with multiple tables operating in different ways -- although maybe it's not huge. Is it possible to disable the SQLAlchemy check for this specific command?

In SQLAlchemy, I think this is in sql/sqltypes.py, specifically in class Boolean. So is that from reflection?

martinburchell commented 5 months ago

The rationale behind this behaviour is explained in https://docs.sqlalchemy.org/en/14/changelog/migration_12.html#boolean-datatype-now-enforces-strict-true-false-none-values.

To disable the checks we would need something like the LiberalBoolean TypeDecorator example in that note. Alternatively if there is a risk that '0' may be interpreted differently by different database backends then maybe we should just filter out any string values from optout_col_values when the column is boolean.