healtex / texscrubber

Personal information de-identification tool
Apache License 2.0
2 stars 2 forks source link

PatientNEWriter component #5

Closed mbelousov closed 7 years ago

mbelousov commented 7 years ago

Appends extracted named entities from GATEDocument to set of patient-specific vocabulary in database

hkkenneth commented 7 years ago

I am quite sceptical about whether we can output one gazetteer per patient because Spring-boot processes the input document by document.

I don't have a concrete plan yet, just a few idea up for discussion.

  1. Processor for each document append to the per-patient output files (potential issue of racing condition?)

  2. Make use of listener: for each document, update an in-memory per-patient NE list. When the processing is done, listener writes the aggregated list in a per-patient way.

  3. Use another step to combine the "per document gazetteer" into "per patient gazetteer".

If we support multi-threading, it may be more complicated.

And the discussion about saving the gazetteer in-memory?

hkkenneth commented 7 years ago

4th idea:

Have a Java Bean to maintain this data structure, which the processor for each document can append to

{
 "pat1": {
   "cat1": ["str1", "str2"],
   "cat2": ["str1", "str2"]
 },
 "pat2": {
   "cat1": ["str1", "str4"],
   "cat2": ["str1", "str3"]
 }
}

Then I assume this Java Bean can survive between 1st pass processing and 2nd pass processing.

Basically this Bean is just an in-memory database...

Assumption: if we can populate GATE gazetteer from memory instead of from file. If files must be used, we need to output the NE lists - at the right time.

dehghana commented 7 years ago

I think we resolved this; i.e., solution:

Make use of listener: for each document, update an in-memory per-patient NE list. When the processing is done, listener writes the aggregated list in a per-patient way.

owendw1 commented 7 years ago

1) In terms of the listener mentioned, are we saying that PatientNEWriter should implement one (such as a GATE DocumentListener)?

2) Is PatientNEWriter still meant to receive a GATEDocument, or does it now instead access the in-memory per-patient NE list (wherever that may live)? And then do we want it to just write the list contents to file in some useful format?

dehghana commented 7 years ago

@spoodlepowered wrt Q2, see the solution we have found: [https://gate.ac.uk/releases/gate-5.1-beta1-build3397-ALL/doc/tao/splitch13.html#x18-30000013.7] I haven't tried the plugin but this has been recommended by GD for our problem.

hkkenneth commented 7 years ago

Make use of listener: for each document, update an in-memory per-patient NE list. When the processing is done, listener writes the aggregated list in a per-patient way.

To be more specific, it breaks down to:

  1. For each processed document, its writer writes only the annotation to an in-memory storage and nothing to the file system.
  2. After such "write", a listener should be triggered to check if all documents for this patient have been processed.
  3. If all documents have been processed, it flushes the in-memory storage for this patient to a file output.

However, we cannot fail first pass in a per document way (as decided last week) because each document writer has to succeed in order to trigger the listener. Any potential problem in this abstraction?

hkkenneth commented 7 years ago

In terms of the listener mentioned, are we saying that PatientNEWriter should implement one (such as a GATE DocumentListener)?

I think we can extend an implementing class of this interface: http://docs.spring.io/spring-batch/apidocs/org/springframework/batch/core/ItemWriteListener.html

Is PatientNEWriter still meant to receive a GATEDocument, or does it now instead access the in-memory per-patient NE list (wherever that may live)? And then do we want it to just write the list contents to file in some useful format?

I don't know yet, I think we have to make the first pass processor to work in action to try it. If first pass processor returns a GATEDocument, then PatientNEWriter should take that and update the in-memory data storage.

mbelousov commented 7 years ago

See #19