Closed ireneisdoomed closed 2 years ago
At the moment I am stuck at task # 1 because I cannot reproduce the env to successfully load the model. This is the error I am having:
ModuleNotFoundError: No module named 'transformers.modeling_bert'
Keep in mind there are 2 levels of categories. For example, the classified stop reason Insufficient enrollment
is also catalogued as Neutral
reason as the problem is considered to be independent of the study design.
The high-level classes are the result of a one-2-one mapping from the low-level classes.
I committed some changes in common_classes.py and predict.py scripts. The common_files now has the mapping to superclass function that is called when outputting the data. The hardcoded path names are now parametrised, so the script should be callable as: python3 predict.py input_file output_stopped_file output_nonstopped_file The commit comment contains more details on the changes.
The enrichment of the categorisation for the reasons why a clinical trial ends on the evidence of ChEMBL 22.02 is as follows:
This is a break down of how the different subclasses are distributed:
Negative
and Safety_Sideeffects
.Possibly_Negative
, I don't think these should be relevant for the scoring, as they are indeed related to administrative reasons.This is what I followed to extract these results. On a separate branch I am working to do the prediction part in the ChEMBL.py module. This is still WIP and will track the discussion in another ticket.
evd = spark.read.json('data/cttv008-20-01-2022.json.gz')
studies = (
evd
# Extract studies with their reasons to stop
.filter(F.col('studyStopReason').isNotNull())
.withColumn('urls', F.explode('urls'))
.filter(F.col('urls.niceName').contains('ClinicalTrials'))
.withColumn('nct_id', F.element_at(F.split(F.col('urls.url'), '%22'), -2))
.select('nct_id', F.col('studyStopReason').alias('why_stopped'))
.distinct()
)
studies.coalesce(1).write.csv('data/studies.tsv', sep='\t', header=True)
By calling the predict.py
module written by Olesya and edited by me.
Code and instructions on how to run it can be found here: https://github.com/ireneisdoomed/stopReasons
This script loads the model, instantiates the BERT Tokenizer and classifier and returns the predictions in a TSV file.
Build the output from step 2 onto the latest ChEMBL submission.
This is handled by the new module ChEMBL.py
Final evidence schema:
root
|-- clinicalPhase: long (nullable = true)
|-- clinicalStatus: string (nullable = true)
|-- datasourceId: string (nullable = true)
|-- datatypeId: string (nullable = true)
|-- diseaseFromSource: string (nullable = true)
|-- diseaseFromSourceMappedId: string (nullable = true)
|-- drugId: string (nullable = true)
|-- studyStartDate: string (nullable = true)
|-- studyStopReason: string (nullable = true)
|-- targetFromSource: string (nullable = true)
|-- targetFromSourceId: string (nullable = true)
|-- urls: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- niceName: string (nullable = true)
| | |-- url: string (nullable = true)
|-- studyStopReasonCategories: array (nullable = true) <--- New field
| |-- element: string (containsNull = true)
As a developer I want to categorise the reasons why a Clinical Trial has stopped in order to analyse ChEMBL evidence in a new dimension and score the evidence accordingly.
Background
This ticket belongs to the epic where we want to integrate @LesyaR's NLP model onto the Platform's pipelines.
The NLP model can be downloaded here:
gs://ot-team/olesya/bert_trials
Olesya has put together some scripts to run predictions:Tasks
The idea is to add a new column called
studyStopReasonCategory
that is a result of applying the model onstudyStopReason
. We want to achieve this in a computationally efficient way.predictions.py
by importing ChEMBL's data in a Pandas DF. Evaluate performance.create_predictions
method in an UDF. Evaluate performance.Acceptance tests
How do we know the task is complete?
NCT05075902
, reason to stop isCOVID19
and the assigned category isCovid19
.NCT04958967
, reason to stop isThe protocol need to review
and the assigned category isStudy_Design
.NCT05067322
, reason to stop isLow enrollment at the site.
and the assigned category isInsufficient_Enrollment
.NCT05048511
, reason to stop isThe project was abandoned because of a lot of publications on the subject in the meantime and it was not considered relevant to continue.
and the assigned category isAnother_Study
.