healtex / texscrubber

Personal information de-identification tool
Apache License 2.0
2 stars 2 forks source link

Initial integration of NER into Spring batch #24

Open mbelousov opened 7 years ago

mbelousov commented 7 years ago

Hey guys,

Please have a look into feature/ner-integration branch. I have integrated NER (from feature/NamedEntityExtractor) into our development branch and implemented majority of stuff that we have discussed during a call.

  1. I've decided to keep separate data structure for annotated document (so renamed GATEDocument into AnnotatedDocument that has gate.Document as a property). Ideally it should extend gate.Document rather than have it as a property.

  2. It turned out that we need to have two files per patient (both .lst and .def), but if you know how to minimise the number of files please let me know.

  3. Currently it writes gateDoc XML into result txt files, so what we need to do is to implement getScrubbedContent properly for AnnotatedDocument.

  4. There are bunch of stuff that we could try to optimise in order to keep I/O operations to minimum, I have ignored this for now, but feel free to contribute.

Before we can merge it into development, let's test it and discuss all issues that are left.

mbelousov commented 7 years ago

Regarding 3, it was already implemented (feature/ScrubberProcessor) by @dehghana, just need to be integrated.

hkkenneth commented 7 years ago

In NamedEntitiesWriter.write(), what's the purpose of Clear person gazetteers part? Do you want to make sure files generated in the previous run are erased?

This has problem in situation when one patient has many documents. E.g. in the first call to write(), 2 documents for patient number 5 are passed in, those NEs will be written. Then in the second call to write(), 3 other documents for patient number 5 are passed in. The second 3 doc will overwrite the NEs found in the first 2 doc.

Also, Append to person gazetteers may input duplicate named entities in the .lst file - possibly not a breaking problem but may be inefficient.

A way to reduce I/O could be

  1. For list of Annotated files passed to NamedEntitiesWriter.write(), consolidate their named entities in a per patient fashion (e.g. this gives us Map<String, List> for Map<patient ID, NEs>)
  2. For each patient ID, read the existing ID.lst, and append to it the NEs which are not present in the file already.

The rest seems to work quite well.

hkkenneth commented 7 years ago

But I don't think DeidentifiedDocumentWriter.write() is the right place to "Remove all patients gazetteers".

First, it may delete .lst and .def prematurely (e.g. in the above example, when the first 2 doc for patient number 5 have been written out by DeidentifiedDocumentWriter, their .lst and .def got deleted and SecondPassItemProcessor.process() cannot find them when processing the second batch of 3 doc. (Worst thing is, even if the gazetteer files are not found, it is ignored silently by GATE).

Second, it may tamper with retry logic in Spring batch.

I think a better place is to do such clean up is in a listener after 2nd pass (inherently we need to keep track of how much doc per patient and count how many have been processed / written)?

hkkenneth commented 7 years ago

Thanks @mbelousov , I don't have further comments.

One last long term question: Can writer be concurrent when we support multi-threading? Will two threads attempt to write to same patient's gazetteer file together?

mbelousov commented 7 years ago

@hkkenneth, we have committed changes to fix the logic and for issue #9. Would be awesome if you could have a look into workflow and add job listener to clean output folders. Feel free to create a separate issue for that, but if you have any concerns regarding the logic of the current workflow let's discuss it here. Thanks!

hkkenneth commented 7 years ago

Not urgent - for the next version we may take a look whether duplicate entries in the .lst files may be bad for performance. Compare that with the approach of keeping a set of "written" NEs per patient throughout the entire firstPass until all gazetteer files are written?