Termset choices - Githubissues

jenojp / negspacy

spaCy pipeline object for negating concepts in text

MIT License

274 stars 36 forks source link

Termset choices #10

Closed snosrap closed 4 years ago

snosrap commented 4 years ago

Thanks for enabling negex in the spaCy ecosystem -- this is incredibly helpful.

I noticed your termsets.py file is a subset of the trigger words/phrases historically used by negex (see here)

Was this for performance issues? Or to make negspacy more generalizable in non-healthcare domains? Some other reason?

I'm aware you can override negspacy's default termsets (nice feature), so this is more of a general question.

Thanks again for making this available.

jenojp commented 4 years ago

Hey thanks! That's a great question... Definitely not for performance reasons but more for generalizability. Additionally, some are repetitive for our purposes because of the way we're doing matching (e.g., no and no mammographic evidence of as preceding negations end of being duplicates). I certainly could go through and simplify the list a bit more.

I use it for my work in a clinical domain so I tend to add a lot more hypothetical negations manually like:

# hypotheticals
            "concern for",
            "supposed",
            "which causes",
            "leads to",
            "h/o",
            "history of",
            "symptoms atypical even for",
            "without any reactions or signs of",
            "instead of",
            "if you experience",
            "if you get",
            "teaching the patient",
            "taught the patient",
            "teach the patient",
            "educated the patient",
            "educate the patient",
            "educating the patient",
            "monitored for",
            "monitor for",
            "test for",
            "tested for"

I've been torn on whether to add these in as defaults or not. Perhaps the right way is to add a new term set language and call it "en_clinical"? If healthcare is your domain, I'd appreciate any feedback if I go that route.

snosrap commented 4 years ago

Hi Jeno, thanks that makes sense. I'm also in the clinical domain so I'd certainly be interested in a curated "en_clinical" language.

Some of the hypotheticals you listed, however, might be too sensitive/greedy for certain use cases (i.e., marking a term as negated when it shouldn't be). For example, "Pt has a history of diabetes" strikes me as a positive statement about diabetes. Similarly, "taught the patient about his new diabetes diagnosis" also seems like a positive statement.

If you add a new clinical language/termset, you'd probably want to make it clear whether it's for: 1) detecting clinical entities that have been negated or 2) filtering clinical entities that are present/true but possibly irrelevant (which is obviously useful, but seems like a different use case)

Thanks again for making this tool.

jenojp commented 4 years ago

That's a great point. This particular use case was more of the 2nd... we were tweaking to avoid patient education (a source of many false positives and was typically showing as educating the patient about signs and symptoms to watch out for after a procedure that would be indicative of a problem) and mentions that weren't from a present encounter. So that's probably an important distinction to keep. If you don't mind, I'll keep this open and have you take a look when I get around to formalizing this.

jenojp commented 4 years ago

Update on this front: I ended up making three separate termsets that build upon each other. Since it seems the primary use cases from users are clinical in nature, I decided to make en_clinical the default to avoid breaking expected behavior with the 0.1.7 release. en_clinical is similar to en in previous versions while en has clinical specific terminology removed. A third option, en_clinical_sensitive adds terms that help rule out possibly irrelevant entities if that is desired.

From readme:

Termsets

Designate termset to use, en_clinical is used by default.

negex = Negex(nlp, language = "en_clinical")

en = phrases for general english language text
en_clinical DEFAULT = adds phrases specific to clinical domain to general english
en_clinical_sensitive = adds additional phrases to help rule out historical and possibly irrelevant entities

Commit f46471ebe60122e9e4d43b435a59610db5cb9948 and commit a095215d29fe0499e98f36597687f708948bc1a3