Closed snosrap closed 4 years ago
Hey thanks! That's a great question... Definitely not for performance reasons but more for generalizability. Additionally, some are repetitive for our purposes because of the way we're doing matching (e.g., no
and no mammographic evidence of
as preceding negations end of being duplicates). I certainly could go through and simplify the list a bit more.
I use it for my work in a clinical domain so I tend to add a lot more hypothetical negations manually like:
# hypotheticals
"concern for",
"supposed",
"which causes",
"leads to",
"h/o",
"history of",
"symptoms atypical even for",
"without any reactions or signs of",
"instead of",
"if you experience",
"if you get",
"teaching the patient",
"taught the patient",
"teach the patient",
"educated the patient",
"educate the patient",
"educating the patient",
"monitored for",
"monitor for",
"test for",
"tested for"
I've been torn on whether to add these in as defaults or not. Perhaps the right way is to add a new term set language and call it "en_clinical"? If healthcare is your domain, I'd appreciate any feedback if I go that route.
Hi Jeno, thanks that makes sense. I'm also in the clinical domain so I'd certainly be interested in a curated "en_clinical" language.
Some of the hypotheticals you listed, however, might be too sensitive/greedy for certain use cases (i.e., marking a term as negated when it shouldn't be). For example, "Pt has a history of diabetes" strikes me as a positive statement about diabetes. Similarly, "taught the patient about his new diabetes diagnosis" also seems like a positive statement.
If you add a new clinical language/termset, you'd probably want to make it clear whether it's for: 1) detecting clinical entities that have been negated or 2) filtering clinical entities that are present/true but possibly irrelevant (which is obviously useful, but seems like a different use case)
Thanks again for making this tool.
That's a great point. This particular use case was more of the 2nd... we were tweaking to avoid patient education (a source of many false positives and was typically showing as educating the patient about signs and symptoms to watch out for after a procedure that would be indicative of a problem) and mentions that weren't from a present encounter. So that's probably an important distinction to keep. If you don't mind, I'll keep this open and have you take a look when I get around to formalizing this.
Update on this front: I ended up making three separate termsets that build upon each other. Since it seems the primary use cases from users are clinical in nature, I decided to make en_clinical
the default to avoid breaking expected behavior with the 0.1.7 release. en_clinical
is similar to en
in previous versions while en
has clinical specific terminology removed. A third option, en_clinical_sensitive
adds terms that help rule out possibly irrelevant entities if that is desired.
From readme:
Designate termset to use, en_clinical
is used by default.
negex = Negex(nlp, language = "en_clinical")
en
= phrases for general english language texten_clinical
DEFAULT = adds phrases specific to clinical domain to general englishen_clinical_sensitive
= adds additional phrases to help rule out historical and possibly irrelevant entitiesCommit f46471ebe60122e9e4d43b435a59610db5cb9948 and commit a095215d29fe0499e98f36597687f708948bc1a3
Thanks for enabling negex in the spaCy ecosystem -- this is incredibly helpful.
I noticed your termsets.py file is a subset of the trigger words/phrases historically used by negex (see here)
Was this for performance issues? Or to make negspacy more generalizable in non-healthcare domains? Some other reason?
I'm aware you can override negspacy's default termsets (nice feature), so this is more of a general question.
Thanks again for making this available.