Initial conversion - Githubissues

ctmay4 commented 8 months ago

We need to move the logic from SEER*DMS to this library. A few things to consider:

We have our own implementation of Aho-Corasic in DMS. That implementation was copied to SEER*API. I think we should consider using an external library for this. https://github.com/robert-bor/aho-corasick looks like it would work.
This library will be used externally as well as in SEER*DMS. By default, the library will have the keyword definitions internally. That is the default usage pattern. When SEER*DMS uses it, we want to use the keyword definitions in the DMS database. We need to design this library to allow that.

As far as design the workflow will go

initialization

// load default reportability keyword list
var builder = new ReportabilityScreenerBuilder();

// add internal keywords
builder.defaultKeywords();

// add external keywords one at a time
builder.add(keyword);

// or external keywords all at once
builder.add(keywords);

// finally call build
var screener = builder.build();

screen specific text

ScreeningResult result = screener.screen(text);

the result will contain the following:

public class ScreeningResult {
  public enum ReportabilityResult {
     REPORTABLE, NON_REPORTABLE, UNKNOWN
  }

  ReportabilityResult result;

  List<Keyword> positiveKeywords;
  List<Keyword> negativeKeywords;
  List<KeyWord> otherKeywords;
}

Note the Keyword class will include the keyword and the start/end position of the match as well as ignored.

As for the keywords themselves, I ran this query:

select g.name as group, count(*)
from screening_keyword sk
   join screening_group g on (sk.keyword_group = g.group_id)
where reportability_alg_keyword = true
group by 1;

and got the following. Do we need to keep the "Other" group? Do they factor into reportability?

REG	group	count
SE	Other	378
SE	Negative	625
SE	Positive	1526

We will discuss in the meeting on Thursday.

ctmay4 commented 8 months ago

Adding @garybeverungen

garybeverungen commented 8 months ago

I made a first pass at implementing the keyword screening algorithm based on your notes. Let me know what you think.

FYI, I added a dependency for opencsv so I could use CSVReader to read the internal keyword file (just like in SEER*DMS). But there is a warning in the POM about it. Let me know if I need to remove that.

ctmay4 commented 8 months ago

I'll take a look tomorrow. Please create a pull request.

ctmay4 commented 8 months ago

Also, opencsv brings a bunch of other commons dependencies. I've actually switched to fastcsv as my CSV library, but in this case I'd prefer to use no library at all. I think the easiest thing would be to switch the file to tab-separated and then you can rewmove the quoting. At that point just split on tab and no library is needed.

ctmay4 commented 8 months ago

@depryf commented to me he doesn't like tabs. He suggested using a pipe. We just need to make sure none of the keywords contain pipes now. I did a quick check and it appears that none do.

imsweb / reportability-screener

Initial conversion #1