smart-on-fhir / cumulus-etl

Extract FHIR data, Transform with NLP and DEID tools, and then Load FHIR data into a SQL Database for analysis
https://docs.smarthealthit.org/cumulus/etl
Apache License 2.0
11 stars 2 forks source link

Enable SmokingStatus via cTAKES #273

Open comorbidity opened 1 year ago

comorbidity commented 1 year ago

Official cTAKES page https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+-+Smoking+status#cTAKES4.0Smokingstatus-OverviewofSmokingstatus

This reference implementation may be helpful

comorbidity commented 1 year ago

@mikix ideally, you can more simply turn on the pipeline using this "piper" file with the right settings https://github.com/text2phenotype/ctakes/tree/main/src/main/resources/com/text2phenotype/ctakes/resources/smoking_status

comorbidity commented 1 year ago

https://cwiki.apache.org/confluence/display/CTAKES/Piper+Files

mikix commented 1 year ago

OK initial thoughts:

Seems like cTAKES's smoking status support is not a turnkey "flip this switch" kind of thing, but a bunch of pieces that you put together.

text2phenotype

And the text2phenotype project has done that for us. They have:

I don't have a very good understanding of what text2phenotype is doing here, but it seems raw/pure cTAKES is not easy to pull together ourselves, and we might want to leverage that work.

Update Oct 2023: I tried to build their docker image. After some tweaks to get things going like switching to the right jdk version (9-jdk8-corretto) and changing how resources get included (commenting out INCLUDE_RES=true and copying in some files from the source tree)... I was finally able to get something that seemed to run without errors. But I could not figure out how to query it via REST API. I know that sounds silly, but like, no endpoints seemed to be exposed at all. My next step was to either try to get Tomcat to print registered endpoints and debug it at the Tomcat level OR take the smoking code from text2phenotype and put it into an upstream cTAKES checkout which I do have working for Cumulus. But both involve potentially-deep Java coding and I'm focusing on other things right now.)

Integrating into Cumulus

Our current cTAKES is set up as a symptoms extractor by default. We could use a similar override method to replace its pipeline, like we do for the symptoms dictionary. But we also would need to inject a bunch of built Java in there. Which we could also hook up.

But... maybe we just build the text2phenotype docker image, throw it up and use that - meaning we now have two cTAKES images we are building, but for different use cases. Maybe wise, maybe not.

The other interesting part of that is how Cumulus manages multiple dependent services. Ideally it would be able to spin them up and down as needed. But since docker compose doesn't really work like that, we could outsource that to the user and switch our current paradigm from one global etl-support profile to study-specific profiles and the user would start up what they know they need. So docker compose up --profile covid_symptom for example. (Update Oct 2023: this was done elsewhere when we integrated the termexists cNLP transformer)

Go or Java

text2phenotype also has a Go version that might be worth exploring, as it would be faster and is apparently battle-tested for their use cases.

Update Oct 2023: The Go code is designed for a very text2phenotype workflow. It reads notes in via rabbitmq and drops the results into an S3 bucket. Whereas Cumulus expects to talk to NLP over a REST API. So that would be some work to get going.

cNLP Transformers

Tim also suggested that we could just build a new cNLP BERT model for smoking status, and skip cTAKES. We suspect that would have better performance and Tim doesn't think it's hard. That sounds tempting...

dogversioning commented 1 year ago

But since docker compose doesn't really work like that, we could outsource that to the user and switch our current paradigm from one global etl-support profile to study-specific profiles and the user would start up what they know they need. So docker compose up --profile covid_symptom for example.

That is fine, though the space might get big at some point. We're edging up on the 'we should use a proper container orchestration platform' territory - we could spin up resources as needed if we're clever about it with swarm/k8s.

comorbidity commented 1 year ago

@tmills believes BERT would beat SOTA for any of the existing SmokingStatus pipelines, which makes sense, because the old SmokingStatus pipelines are very old...like 10-15 years old. Even the GO version is a literally-exact translation of the algorithm, just in GO which is much faster

mikix commented 1 year ago

So the tradeoff is faster Go vs more-accurate BERT?

Go path is something like "build the docker, throw it up in our docker hub, and reference it from Cumulus's compose file"

BERT path is "go train a cNLP BERT model like we did for negation, and do the same docker hub dance"

In either case, there'd be some integration work in ETL land to call the right service, but that would be a similar amount of effort for both.

mikix commented 1 year ago

@comorbidity is there value in doing both? Like part of looking at smoking status was comparing approaches right? Or do you think BERT is just going to be so obviously better, and we don't super care about the performance?

comorbidity commented 1 year ago

@mikix great question and discussion. the BERT based model should expect to run almost identical to the current "negation" pipeline. What was the speed of cNLP negation compared to cTAKES for the symptoms dictionary?

mikix commented 1 year ago

What was the speed of cNLP negation compared to cTAKES for the symptoms dictionary?

I do not remember numbers, but I recall cTAKES being faster. I could get numbers if that would guide our discussion.

comorbidity commented 1 year ago

Another advantage of the BERT approach is reusability: there is nothing special about this Smoking model -- from the BERT perspective its just words and the label "smoker", "non-smoker", etc.

Therefore: long-term-time-invested in BERT model would be better (using cNLP) because we could reuse it for any number of tasks that we needed a model for.

mikix commented 11 months ago

Minor note: I updated my comment above with the results of some investigations. I'm putting this down for now to focus on other priorities, but may come back to this.