Closed mbelousov closed 7 years ago
I am quite sceptical about whether we can output one gazetteer per patient because Spring-boot processes the input document by document.
I don't have a concrete plan yet, just a few idea up for discussion.
Processor for each document append to the per-patient output files (potential issue of racing condition?)
Make use of listener: for each document, update an in-memory per-patient NE list. When the processing is done, listener writes the aggregated list in a per-patient way.
Use another step to combine the "per document gazetteer" into "per patient gazetteer".
If we support multi-threading, it may be more complicated.
And the discussion about saving the gazetteer in-memory?
4th idea:
Have a Java Bean to maintain this data structure, which the processor for each document can append to
{
"pat1": {
"cat1": ["str1", "str2"],
"cat2": ["str1", "str2"]
},
"pat2": {
"cat1": ["str1", "str4"],
"cat2": ["str1", "str3"]
}
}
Then I assume this Java Bean can survive between 1st pass processing and 2nd pass processing.
Basically this Bean is just an in-memory database...
Assumption: if we can populate GATE gazetteer from memory instead of from file. If files must be used, we need to output the NE lists - at the right time.
I think we resolved this; i.e., solution:
Make use of listener: for each document, update an in-memory per-patient NE list. When the processing is done, listener writes the aggregated list in a per-patient way.
1) In terms of the listener mentioned, are we saying that PatientNEWriter should implement one (such as a GATE DocumentListener)?
2) Is PatientNEWriter still meant to receive a GATEDocument, or does it now instead access the in-memory per-patient NE list (wherever that may live)? And then do we want it to just write the list contents to file in some useful format?
@spoodlepowered wrt Q2, see the solution we have found: [https://gate.ac.uk/releases/gate-5.1-beta1-build3397-ALL/doc/tao/splitch13.html#x18-30000013.7] I haven't tried the plugin but this has been recommended by GD for our problem.
Make use of listener: for each document, update an in-memory per-patient NE list. When the processing is done, listener writes the aggregated list in a per-patient way.
To be more specific, it breaks down to:
However, we cannot fail first pass in a per document way (as decided last week) because each document writer has to succeed in order to trigger the listener. Any potential problem in this abstraction?
In terms of the listener mentioned, are we saying that PatientNEWriter should implement one (such as a GATE DocumentListener)?
I think we can extend an implementing class of this interface: http://docs.spring.io/spring-batch/apidocs/org/springframework/batch/core/ItemWriteListener.html
Is PatientNEWriter still meant to receive a GATEDocument, or does it now instead access the in-memory per-patient NE list (wherever that may live)? And then do we want it to just write the list contents to file in some useful format?
I don't know yet, I think we have to make the first pass processor to work in action to try it. If first pass processor returns a GATEDocument, then PatientNEWriter should take that and update the in-memory data storage.
See #19
Appends extracted named entities from GATEDocument to set of patient-specific vocabulary in database