Open mbelousov opened 7 years ago
Regarding 3, it was already implemented (feature/ScrubberProcessor
) by @dehghana, just need to be integrated.
In NamedEntitiesWriter.write()
, what's the purpose of Clear person gazetteers
part? Do you want to make sure files generated in the previous run are erased?
This has problem in situation when one patient has many documents.
E.g. in the first call to write()
, 2 documents for patient number 5 are passed in, those NEs will be written. Then in the second call to write()
, 3 other documents for patient number 5 are passed in. The second 3 doc will overwrite the NEs found in the first 2 doc.
Also, Append to person gazetteers
may input duplicate named entities in the .lst
file - possibly not a breaking problem but may be inefficient.
A way to reduce I/O could be
NamedEntitiesWriter.write()
, consolidate their named entities in a per patient fashion (e.g. this gives us Map<String, ListThe rest seems to work quite well.
But I don't think DeidentifiedDocumentWriter.write()
is the right place to "Remove all patients gazetteers".
First, it may delete .lst
and .def
prematurely (e.g. in the above example, when the first 2 doc for patient number 5 have been written out by DeidentifiedDocumentWriter
, their .lst
and .def
got deleted and SecondPassItemProcessor.process()
cannot find them when processing the second batch of 3 doc. (Worst thing is, even if the gazetteer files are not found, it is ignored silently by GATE).
Second, it may tamper with retry logic in Spring batch.
I think a better place is to do such clean up is in a listener after 2nd pass (inherently we need to keep track of how much doc per patient and count how many have been processed / written)?
Thanks @mbelousov , I don't have further comments.
One last long term question: Can writer be concurrent when we support multi-threading? Will two threads attempt to write to same patient's gazetteer file together?
@hkkenneth, we have committed changes to fix the logic and for issue #9. Would be awesome if you could have a look into workflow and add job listener to clean output folders. Feel free to create a separate issue for that, but if you have any concerns regarding the logic of the current workflow let's discuss it here. Thanks!
Not urgent - for the next version we may take a look whether duplicate entries in the .lst
files may be bad for performance. Compare that with the approach of keeping a set of "written" NEs per patient throughout the entire firstPass until all gazetteer files are written?
Hey guys,
Please have a look into
feature/ner-integration
branch. I have integrated NER (fromfeature/NamedEntityExtractor
) into ourdevelopment
branch and implemented majority of stuff that we have discussed during a call.I've decided to keep separate data structure for annotated document (so renamed GATEDocument into AnnotatedDocument that has gate.Document as a property). Ideally it should extend gate.Document rather than have it as a property.
It turned out that we need to have two files per patient (both .lst and .def), but if you know how to minimise the number of files please let me know.
Currently it writes gateDoc XML into result txt files, so what we need to do is to implement getScrubbedContent properly for AnnotatedDocument.
There are bunch of stuff that we could try to optimise in order to keep I/O operations to minimum, I have ignored this for now, but feel free to contribute.
Before we can merge it into development, let's test it and discuss all issues that are left.