ireneisdoomed commented 2 years ago

As a developer I want to categorise the reasons why a Clinical Trial has stopped in order to analyse ChEMBL evidence in a new dimension and score the evidence accordingly.

Background

This ticket belongs to the epic where we want to integrate @LesyaR's NLP model onto the Platform's pipelines.

The NLP model can be downloaded here: gs://ot-team/olesya/bert_trials Olesya has put together some scripts to run predictions:

common_classes.py: boilerplate code to initialise the BERT model
predictions.py: code to run prediction on a Pandas DataFrame

Tasks

The idea is to add a new column called studyStopReasonCategory that is a result of applying the model on studyStopReason. We want to achieve this in a computationally efficient way.

[x] Run predictions.py by importing ChEMBL's data in a Pandas DF. Evaluate performance.
[ ] Move the logic to PySpark. The most straightforward way is to encapsulate the create_predictions method in an UDF. Evaluate performance.

Acceptance tests

How do we know the task is complete?

When I check the CT NCT05075902, reason to stop is COVID19 and the assigned category is Covid19.
When I check the CT NCT04958967, reason to stop is The protocol need to review and the assigned category is Study_Design.
When I check the CT NCT05067322, reason to stop is Low enrollment at the site. and the assigned category is Insufficient_Enrollment.
When I check the CT NCT05048511, reason to stop is The project was abandoned because of a lot of publications on the subject in the meantime and it was not considered relevant to continue. and the assigned category is Another_Study.

ireneisdoomed commented 2 years ago

At the moment I am stuck at task # 1 because I cannot reproduce the env to successfully load the model. This is the error I am having: ModuleNotFoundError: No module named 'transformers.modeling_bert'

d0choa commented 2 years ago

Keep in mind there are 2 levels of categories. For example, the classified stop reason Insufficient enrollment is also catalogued as Neutral reason as the problem is considered to be independent of the study design.

The high-level classes are the result of a one-2-one mapping from the low-level classes.

LesyaR commented 2 years ago

I committed some changes in common_classes.py and predict.py scripts. The common_files now has the mapping to superclass function that is called when outputting the data. The hardcoded path names are now parametrised, so the script should be callable as: python3 predict.py input_file output_stopped_file output_nonstopped_file The commit comment contains more details on the changes.

ireneisdoomed commented 2 years ago

Data analysis

The enrichment of the categorisation for the reasons why a clinical trial ends on the evidence of ChEMBL 22.02 is as follows:

The vast majority (~89%) of the evidence will not contain this information because it does not have a reason to terminate. These are mostly 'Completed' studies and therefore do not enter into the analysis.
Among those that do enter the analysis, the model is not able to assign a class to 3213 records.
For those with results, most are assigned to only one class.

This is a break down of how the different subclasses are distributed:

Categories with the most interest are Negative and Safety_Sideeffects.
By taking a look at examples of Possibly_Negative, I don't think these should be relevant for the scoring, as they are indeed related to administrative reasons.

Workflow

This is what I followed to extract these results. On a separate branch I am working to do the prediction part in the ChEMBL.py module. This is still WIP and will track the discussion in another ticket.

1. Prepare input data from ChEMBL evidence

evd = spark.read.json('data/cttv008-20-01-2022.json.gz')

studies = (
    evd

    # Extract studies with their reasons to stop
    .filter(F.col('studyStopReason').isNotNull())
    .withColumn('urls', F.explode('urls'))
    .filter(F.col('urls.niceName').contains('ClinicalTrials'))
    .withColumn('nct_id', F.element_at(F.split(F.col('urls.url'), '%22'), -2))
    .select('nct_id', F.col('studyStopReason').alias('why_stopped'))
    .distinct()
)

studies.coalesce(1).write.csv('data/studies.tsv', sep='\t', header=True)

2. Apply [[BERT]] model

By calling the predict.py module written by Olesya and edited by me. Code and instructions on how to run it can be found here: https://github.com/ireneisdoomed/stopReasons

This script loads the model, instantiates the BERT Tokenizer and classifier and returns the predictions in a TSV file.

3. Build prediction on top of the ChEMBL evidence

Build the output from step 2 onto the latest ChEMBL submission. This is handled by the new module ChEMBL.py

Final evidence schema:

root
 |-- clinicalPhase: long (nullable = true)
 |-- clinicalStatus: string (nullable = true)
 |-- datasourceId: string (nullable = true)
 |-- datatypeId: string (nullable = true)
 |-- diseaseFromSource: string (nullable = true)
 |-- diseaseFromSourceMappedId: string (nullable = true)
 |-- drugId: string (nullable = true)
 |-- studyStartDate: string (nullable = true)
 |-- studyStopReason: string (nullable = true)
 |-- targetFromSource: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)
 |-- urls: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- niceName: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- studyStopReasonCategories: array (nullable = true)  <--- New field
 |    |-- element: string (containsNull = true)

opentargets / issues

Add `studyStopReasonCategory` to the ChEMBL evidence #1878

Background

Tasks

Acceptance tests

Data analysis

Workflow

1. Prepare input data from ChEMBL evidence

2. Apply [[BERT]] model

3. Build prediction on top of the ChEMBL evidence