healtex / texscrubber

Personal information de-identification tool
Apache License 2.0
2 stars 2 forks source link

Initial implementation for spring batch with two steps #15

Closed hkkenneth closed 7 years ago

hkkenneth commented 7 years ago

Points needed to clarify:

  1. What's the variable type expected for:

  2. What's the file output of step1?

    • 1 output file per person, or
    • 1 output file per input file
  3. document skipping logic (i.e. error handling)

    • do we skip at person level or document level if the document failed in first pass

How to run?

  1. Install gradle
  2. Create an input folder with some text files for testing
  3. Create two empty output folders (for pass1 and pass2 respectively)
  4. Update the file paths in the source code (a search for /Users/kennethlui/workspace/texscrubber will help you to find them)
  5. Build with gradle
  6. Run with java -jar build/libs/texscrubber-0.1.0.jar
hkkenneth commented 7 years ago

list of named entity (step1)

list of string, divided by category Map<String, List> { "category1": ["str1" , "str2"], "category2": ["str1" , "str2"] } vocabAnnotator (step2) Create a GATE Gazetteer? scrubberProcessor (step2) ?

What's the file output of step1?

1 output file per person in GATE gazateer format (*.lst) Need temporary folder for each patient

Notes: not all NERs will have two-pass

document skipping logic (i.e. error handling)

do we skip at person level or document level if the document failed in first pass at document level

mbelousov commented 7 years ago

More notes:

  1. For each patient there will be a set of GATE gazetteers (each per NE category)
  2. Output of step 1 should be a set of GATE gazetteers per each patient (in memory) and annotated GATE documents [#4] (since not all NER will have two-pass). Alternatively, we could populate Map [shared object] (per-patient-per-category) during the doc processing and then put them into GATE Gazetteer. (need to clarify this)
  3. skip at document level