ucam-department-of-psychiatry / crate

Create and use de-identified research databases. Preprocess, extract text, anonymise/de-identify, link, apply natural language processing, query for research, manage consent for contact.
GNU General Public License v3.0
19 stars 7 forks source link

BRC GATE Pharmacotherapy: SQL insert fails for Invalid dose value #40

Closed martinburchell closed 3 years ago

martinburchell commented 3 years ago

UUIDs can result in some odd results from brc-gate-pharmacotherapy:


54A14C20-50FB-11EB-AE9C-9999999999C
STOP
2021-01-07 15:34:58.747 [INFO|CrateGatePipeline] Text read from stdin
2021-01-07 15:34:58.747 [DEBUG|CrateGatePipeline] >>>>>>>>>>>>>>>>> CONTENTS OF STDIN:
2021-01-07 15:34:58.748 [DEBUG|CrateGatePipeline] 54A14C20-50FB-11EB-AE9C-9999999999C

2021-01-07 15:34:58.748 [DEBUG|CrateGatePipeline] <<<<<<<<<<<<<<<<<
2021-01-07 15:34:58.748 [INFO|CrateGatePipeline] Running application...
2021-01-07 15:34:58.755 [INFO|CrateGatePipeline] Application complete, processing output...
2021-01-07 15:34:58.755 [DEBUG|CrateGatePipeline] Excluding set as not included: ""
2021-01-07 15:34:58.755 [DEBUG|CrateGatePipeline] Excluding set as not included: "Original markups"
2021-01-07 15:34:58.755 [DEBUG|CrateGatePipeline] Explicitly including set: "Output"
dose-unit       C       _start  22      _type   Prescription    drug-type       GATE_gazetteer  dose-multiple   1       dose-value      -9.999999999E9  _set    Output  drug    C       _end    35      dose    -9999999999C       route   unknown _id     44      _content        C-9999999999C   tense   present status  continuing
2021-01-07 15:34:58.756 [INFO|CrateGatePipeline] Found annotation of type: Prescription
2021-01-07 15:34:58.756 [DEBUG|CrateGatePipeline] >>>>>>>>>>>>>>>>> ANNOTATION:
2021-01-07 15:34:58.756 [DEBUG|CrateGatePipeline] _content:C-9999999999C
2021-01-07 15:34:58.756 [DEBUG|CrateGatePipeline] _end:35
2021-01-07 15:34:58.756 [DEBUG|CrateGatePipeline] _id:44
2021-01-07 15:34:58.756 [DEBUG|CrateGatePipeline] _set:Output
2021-01-07 15:34:58.756 [DEBUG|CrateGatePipeline] _start:22
2021-01-07 15:34:58.756 [DEBUG|CrateGatePipeline] _type:Prescription
2021-01-07 15:34:58.756 [DEBUG|CrateGatePipeline] dose:-9999999999C
2021-01-07 15:34:58.757 [DEBUG|CrateGatePipeline] dose-multiple:1
2021-01-07 15:34:58.757 [DEBUG|CrateGatePipeline] dose-unit:C
2021-01-07 15:34:58.757 [DEBUG|CrateGatePipeline] dose-value:-9.999999999E9
2021-01-07 15:34:58.757 [DEBUG|CrateGatePipeline] drug:C
2021-01-07 15:34:58.757 [DEBUG|CrateGatePipeline] drug-type:GATE_gazetteer
2021-01-07 15:34:58.757 [DEBUG|CrateGatePipeline] route:unknown
2021-01-07 15:34:58.757 [DEBUG|CrateGatePipeline] status:continuing
2021-01-07 15:34:58.757 [DEBUG|CrateGatePipeline] tense:present
2021-01-07 15:34:58.757 [DEBUG|CrateGatePipeline] <<<<<<<<<<<<<<<<<

or:

C-12345E678901C
STOP
2021-01-07 15:38:06.379 [INFO|CrateGatePipeline] Text read from stdin
2021-01-07 15:38:06.380 [DEBUG|CrateGatePipeline] >>>>>>>>>>>>>>>>> CONTENTS OF STDIN:
2021-01-07 15:38:06.380 [DEBUG|CrateGatePipeline] C-12345E678901C

2021-01-07 15:38:06.380 [DEBUG|CrateGatePipeline] <<<<<<<<<<<<<<<<<
2021-01-07 15:38:06.380 [INFO|CrateGatePipeline] Running application...
2021-01-07 15:38:06.389 [INFO|CrateGatePipeline] Application complete, processing output...
2021-01-07 15:38:06.389 [DEBUG|CrateGatePipeline] Excluding set as not included: ""
2021-01-07 15:38:06.390 [DEBUG|CrateGatePipeline] Excluding set as not included: "Original markups"
2021-01-07 15:38:06.390 [DEBUG|CrateGatePipeline] Explicitly including set: "Output"
dose-unit       C       _start  0       _type   Prescription    drug-type       GATE_gazetteer  dose-multiple   1       dose-value      -Infinity       _set    Output  drug    C       _end    15      dose    -12345E678901C     route   unknown _id     20      _content        C-12345E678901C tense   present status  continuing
2021-01-07 15:38:06.391 [INFO|CrateGatePipeline] Found annotation of type: Prescription
2021-01-07 15:38:06.391 [DEBUG|CrateGatePipeline] >>>>>>>>>>>>>>>>> ANNOTATION:
2021-01-07 15:38:06.391 [DEBUG|CrateGatePipeline] _content:C-12345E678901C
2021-01-07 15:38:06.391 [DEBUG|CrateGatePipeline] _end:15
2021-01-07 15:38:06.392 [DEBUG|CrateGatePipeline] _id:20
2021-01-07 15:38:06.392 [DEBUG|CrateGatePipeline] _set:Output
2021-01-07 15:38:06.392 [DEBUG|CrateGatePipeline] _start:0
2021-01-07 15:38:06.392 [DEBUG|CrateGatePipeline] _type:Prescription
2021-01-07 15:38:06.392 [DEBUG|CrateGatePipeline] dose:-12345E678901C
2021-01-07 15:38:06.392 [DEBUG|CrateGatePipeline] dose-multiple:1
2021-01-07 15:38:06.392 [DEBUG|CrateGatePipeline] dose-unit:C
2021-01-07 15:38:06.392 [DEBUG|CrateGatePipeline] dose-value:-Infinity
2021-01-07 15:38:06.392 [DEBUG|CrateGatePipeline] drug:C
2021-01-07 15:38:06.393 [DEBUG|CrateGatePipeline] drug-type:GATE_gazetteer
2021-01-07 15:38:06.393 [DEBUG|CrateGatePipeline] route:unknown
2021-01-07 15:38:06.393 [DEBUG|CrateGatePipeline] status:continuing
2021-01-07 15:38:06.393 [DEBUG|CrateGatePipeline] tense:present
2021-01-07 15:38:06.393 [DEBUG|CrateGatePipeline] <<<<<<<<<<<<<<<<<

We then run into problems when inserting the results because "-9.999999999E9" and "-Infinity" aren't valid:

"Error converting data type nvarchar to numeric." when the dose-value column in the database is defined as decimal.

It appears that 'C' is both a drug and a unit and anything in between is interpreted as a quantity. If there's an E in there, it gets treated as an exponent. Hence the odd results,

martinburchell commented 3 years ago

Also raised an issue with the plugin https://github.com/KHP-Informatics/brc-gate-pharmacotherapy/issues/1

martinburchell commented 3 years ago

Workaround is to define the database column to be varchar so we don't get the SQL errors

RudolfCardinal commented 3 years ago

Do you think we should have the Python code check if its destination column is numeric (via SQLAlchemy), and if so, if the quantity is either non-numeric or non-finite, either (a) replace it with NULL and log a warning ("Rejecting value XXX from column YYY as it is not numeric/finite; replacing it with NULL") or (b) chuck the row and log a corresponding warning? (The downside to VARCHAR is that people may want to do dose calculations.)

martinburchell commented 3 years ago

As we discussed, let's catch the SQL error and ignore the row if the insert fails, logging the error.