opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Add `studyStopReasonCategory` to the ChEMBL evidence #1878

Closed ireneisdoomed closed 2 years ago

ireneisdoomed commented 2 years ago

As a developer I want to categorise the reasons why a Clinical Trial has stopped in order to analyse ChEMBL evidence in a new dimension and score the evidence accordingly.

Background

This ticket belongs to the epic where we want to integrate @LesyaR's NLP model onto the Platform's pipelines.

The NLP model can be downloaded here: gs://ot-team/olesya/bert_trials Olesya has put together some scripts to run predictions:

Tasks

The idea is to add a new column called studyStopReasonCategory that is a result of applying the model on studyStopReason. We want to achieve this in a computationally efficient way.

Acceptance tests

How do we know the task is complete?

  1. When I check the CT NCT05075902, reason to stop is COVID19 and the assigned category is Covid19.
  2. When I check the CT NCT04958967, reason to stop is The protocol need to review and the assigned category is Study_Design.
  3. When I check the CT NCT05067322, reason to stop is Low enrollment at the site. and the assigned category is Insufficient_Enrollment.
  4. When I check the CT NCT05048511, reason to stop is The project was abandoned because of a lot of publications on the subject in the meantime and it was not considered relevant to continue. and the assigned category is Another_Study.
ireneisdoomed commented 2 years ago

At the moment I am stuck at task # 1 because I cannot reproduce the env to successfully load the model. This is the error I am having: ModuleNotFoundError: No module named 'transformers.modeling_bert'

d0choa commented 2 years ago

Keep in mind there are 2 levels of categories. For example, the classified stop reason Insufficient enrollment is also catalogued as Neutral reason as the problem is considered to be independent of the study design.

The high-level classes are the result of a one-2-one mapping from the low-level classes.

LesyaR commented 2 years ago

I committed some changes in common_classes.py and predict.py scripts. The common_files now has the mapping to superclass function that is called when outputting the data. The hardcoded path names are now parametrised, so the script should be callable as: python3 predict.py input_file output_stopped_file output_nonstopped_file The commit comment contains more details on the changes.

ireneisdoomed commented 2 years ago

Data analysis

The enrichment of the categorisation for the reasons why a clinical trial ends on the evidence of ChEMBL 22.02 is as follows: image.png

This is a break down of how the different subclasses are distributed:

image.png

Workflow

This is what I followed to extract these results. On a separate branch I am working to do the prediction part in the ChEMBL.py module. This is still WIP and will track the discussion in another ticket.

1. Prepare input data from ChEMBL evidence

evd = spark.read.json('data/cttv008-20-01-2022.json.gz')

studies = (
    evd

    # Extract studies with their reasons to stop
    .filter(F.col('studyStopReason').isNotNull())
    .withColumn('urls', F.explode('urls'))
    .filter(F.col('urls.niceName').contains('ClinicalTrials'))
    .withColumn('nct_id', F.element_at(F.split(F.col('urls.url'), '%22'), -2))
    .select('nct_id', F.col('studyStopReason').alias('why_stopped'))
    .distinct()
)

studies.coalesce(1).write.csv('data/studies.tsv', sep='\t', header=True)

2. Apply [[BERT]] model

By calling the predict.py module written by Olesya and edited by me. Code and instructions on how to run it can be found here: https://github.com/ireneisdoomed/stopReasons

This script loads the model, instantiates the BERT Tokenizer and classifier and returns the predictions in a TSV file.

3. Build prediction on top of the ChEMBL evidence

Build the output from step 2 onto the latest ChEMBL submission. This is handled by the new module ChEMBL.py

Final evidence schema:

root
 |-- clinicalPhase: long (nullable = true)
 |-- clinicalStatus: string (nullable = true)
 |-- datasourceId: string (nullable = true)
 |-- datatypeId: string (nullable = true)
 |-- diseaseFromSource: string (nullable = true)
 |-- diseaseFromSourceMappedId: string (nullable = true)
 |-- drugId: string (nullable = true)
 |-- studyStartDate: string (nullable = true)
 |-- studyStopReason: string (nullable = true)
 |-- targetFromSource: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)
 |-- urls: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- niceName: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- studyStopReasonCategories: array (nullable = true)  <--- New field
 |    |-- element: string (containsNull = true)