imsweb / reportability-screener

A keyword-based screening library for pathology report text
Other
0 stars 0 forks source link

Initial conversion #1

Closed ctmay4 closed 8 months ago

ctmay4 commented 8 months ago

We need to move the logic from SEER*DMS to this library. A few things to consider:

As far as design the workflow will go

Note the Keyword class will include the keyword and the start/end position of the match as well as ignored.

As for the keywords themselves, I ran this query:

select g.name as group, count(*)
from screening_keyword sk
   join screening_group g on (sk.keyword_group = g.group_id)
where reportability_alg_keyword = true
group by 1;

and got the following. Do we need to keep the "Other" group? Do they factor into reportability?

REG group count
SE Other 378
SE Negative 625
SE Positive 1526

We will discuss in the meeting on Thursday.

ctmay4 commented 8 months ago

Adding @garybeverungen

garybeverungen commented 8 months ago

I made a first pass at implementing the keyword screening algorithm based on your notes. Let me know what you think.

FYI, I added a dependency for opencsv so I could use CSVReader to read the internal keyword file (just like in SEER*DMS). But there is a warning in the POM about it. Let me know if I need to remove that.

ctmay4 commented 8 months ago

I'll take a look tomorrow. Please create a pull request.

ctmay4 commented 8 months ago

Also, opencsv brings a bunch of other commons dependencies. I've actually switched to fastcsv as my CSV library, but in this case I'd prefer to use no library at all. I think the easiest thing would be to switch the file to tab-separated and then you can rewmove the quoting. At that point just split on tab and no library is needed.

ctmay4 commented 8 months ago

@depryf commented to me he doesn't like tabs. He suggested using a pipe. We just need to make sure none of the keywords contain pipes now. I did a quick check and it appears that none do.