Initial implementation for spring batch with two steps

hkkenneth commented 7 years ago

Points needed to clarify:

What's the variable type expected for:
- list of named entity (step1)
- vocabAnnotator (step2)
- scrubberProcessor (step2) (ref: https://trello.com/b/lVXZsSel/two-pass-de-identification)
What's the file output of step1?
- 1 output file per person, or
- 1 output file per input file
document skipping logic (i.e. error handling)
- do we skip at person level or document level if the document failed in first pass

How to run?

Install gradle
Create an input folder with some text files for testing
Create two empty output folders (for pass1 and pass2 respectively)
Update the file paths in the source code (a search for /Users/kennethlui/workspace/texscrubber will help you to find them)
Build with gradle
Run with java -jar build/libs/texscrubber-0.1.0.jar

hkkenneth commented 7 years ago

list of named entity (step1)

list of string, divided by category Map<String, List> { "category1": ["str1" , "str2"], "category2": ["str1" , "str2"] } vocabAnnotator (step2) Create a GATE Gazetteer? scrubberProcessor (step2) ?

What's the file output of step1?

1 output file per person in GATE gazateer format (*.lst) Need temporary folder for each patient

Notes: not all NERs will have two-pass

document skipping logic (i.e. error handling)

do we skip at person level or document level if the document failed in first pass at document level

mbelousov commented 7 years ago

More notes:

For each patient there will be a set of GATE gazetteers (each per NE category)
Output of step 1 should be a set of GATE gazetteers per each patient (in memory) and annotated GATE documents [#4] (since not all NER will have two-pass). Alternatively, we could populate Map [shared object] (per-patient-per-category) during the doc processing and then put them into GATE Gazetteer. (need to clarify this)
skip at document level

healtex / texscrubber

Initial implementation for spring batch with two steps #15