Open comorbidity opened 1 year ago
@mikix ideally, you can more simply turn on the pipeline using this "piper" file with the right settings https://github.com/text2phenotype/ctakes/tree/main/src/main/resources/com/text2phenotype/ctakes/resources/smoking_status
OK initial thoughts:
Seems like cTAKES's smoking status support is not a turnkey "flip this switch" kind of thing, but a bunch of pieces that you put together.
And the text2phenotype project has done that for us. They have:
I don't have a very good understanding of what text2phenotype
is doing here, but it seems raw/pure cTAKES is not easy to pull together ourselves, and we might want to leverage that work.
Update Oct 2023: I tried to build their docker image. After some tweaks to get things going like switching to the right jdk version (9-jdk8-corretto
) and changing how resources get included (commenting out INCLUDE_RES=true
and copying in some files from the source tree)... I was finally able to get something that seemed to run without errors. But I could not figure out how to query it via REST API. I know that sounds silly, but like, no endpoints seemed to be exposed at all. My next step was to either try to get Tomcat to print registered endpoints and debug it at the Tomcat level OR take the smoking code from text2phenotype
and put it into an upstream cTAKES checkout which I do have working for Cumulus. But both involve potentially-deep Java coding and I'm focusing on other things right now.)
Our current cTAKES is set up as a symptoms extractor by default. We could use a similar override method to replace its pipeline, like we do for the symptoms dictionary. But we also would need to inject a bunch of built Java in there. Which we could also hook up.
But... maybe we just build the text2phenotype
docker image, throw it up and use that - meaning we now have two cTAKES images we are building, but for different use cases. Maybe wise, maybe not.
The other interesting part of that is how Cumulus manages multiple dependent services. Ideally it would be able to spin them up and down as needed. But since docker compose doesn't really work like that, we could outsource that to the user and switch our current paradigm from one global (Update Oct 2023: this was done elsewhere when we integrated the etl-support
profile to study-specific profiles and the user would start up what they know they need. So docker compose up --profile covid_symptom
for example.termexists
cNLP transformer)
text2phenotype
also has a Go version that might be worth exploring, as it would be faster and is apparently battle-tested for their use cases.
Update Oct 2023: The Go code is designed for a very text2phenotype
workflow. It reads notes in via rabbitmq and drops the results into an S3 bucket. Whereas Cumulus expects to talk to NLP over a REST API. So that would be some work to get going.
Tim also suggested that we could just build a new cNLP BERT model for smoking status, and skip cTAKES. We suspect that would have better performance and Tim doesn't think it's hard. That sounds tempting...
But since docker compose doesn't really work like that, we could outsource that to the user and switch our current paradigm from one global etl-support profile to study-specific profiles and the user would start up what they know they need. So docker compose up --profile covid_symptom for example.
That is fine, though the space might get big at some point. We're edging up on the 'we should use a proper container orchestration platform' territory - we could spin up resources as needed if we're clever about it with swarm/k8s.
@tmills believes BERT would beat SOTA for any of the existing SmokingStatus pipelines, which makes sense, because the old SmokingStatus pipelines are very old...like 10-15 years old. Even the GO version is a literally-exact translation of the algorithm, just in GO which is much faster
So the tradeoff is faster Go vs more-accurate BERT?
Go path is something like "build the docker, throw it up in our docker hub, and reference it from Cumulus's compose file"
BERT path is "go train a cNLP BERT model like we did for negation, and do the same docker hub dance"
In either case, there'd be some integration work in ETL land to call the right service, but that would be a similar amount of effort for both.
@comorbidity is there value in doing both? Like part of looking at smoking status was comparing approaches right? Or do you think BERT is just going to be so obviously better, and we don't super care about the performance?
@mikix great question and discussion. the BERT based model should expect to run almost identical to the current "negation" pipeline. What was the speed of cNLP negation compared to cTAKES for the symptoms dictionary?
What was the speed of cNLP negation compared to cTAKES for the symptoms dictionary?
I do not remember numbers, but I recall cTAKES being faster. I could get numbers if that would guide our discussion.
Another advantage of the BERT approach is reusability: there is nothing special about this Smoking model -- from the BERT perspective its just words and the label "smoker", "non-smoker", etc.
Therefore: long-term-time-invested in BERT model would be better (using cNLP) because we could reuse it for any number of tasks that we needed a model for.
Minor note: I updated my comment above with the results of some investigations. I'm putting this down for now to focus on other priorities, but may come back to this.
Official cTAKES page https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+-+Smoking+status#cTAKES4.0Smokingstatus-OverviewofSmokingstatus
This reference implementation may be helpful