opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Include target core essentiality from Cancer DepMap #2917

Closed d0choa closed 1 year ago

d0choa commented 1 year ago

Background

A piece of information missing from the target page is whether the target can be catalogued as a core essential gene. These genes are unlikely to tolerate an inhibition and are therefore susceptible to causing adverse events if modulated. Knowing that a target is essential would generally discourage a drug discovery scientist to develop an inhibition strategy (exceptions aside).

One source of information we can use is the Cancer Dependency Map. In this project and its ancillary projects (Achilles etc.), they measured fitness after the inhibition of individual genes across a number of cell lines. They catalogued a gene as core essential If the majority of cell lines died after inhibition/KO. Although in cancer, this experiment represents a good proxy of whether loss-of-functions are tolerated across a diverse set of tissues. More details here

Data

There are different tiers of data, but I think we would be ok with the slimmest dataset CRISPRInferredCommonEssentials.csv containing 1,856 gene symbols (25kb). The file is available to download on this page: https://depmap.org/portal/download/all/ and apparently hosted in AWS.

Actions

We might need a series of tickets to implement this feature:

a) Include file in PIS [BE] b) map gene symbol to Ensembl and include data field in the target step ETL [BE] c) expose field in the API [BE] d) show the core essential gene flag/chip on the target page. [FE]

In the second phase, we will probably add a target prioritisation column as well, so it's available in AOTF.

inessmit commented 1 year ago

The gene essentiality field will be useful for safety analyses so it would be great to have this on the target page.

I was recommended this paper by Paula Weidemueller from Evangelia's group as the most comprehensive benchmark paper: https://doi.org/10.1186/s12864-021-08129-5

Table S1 (https://static-content.springer.com/esm/art%3A10.1186%2Fs12864-021-08129-5/MediaObjects/12864_2021_8129_MOESM1_ESM.xlsx) contains all the datasets (already mapped to ENS), which were creating running a few gene-essentiality identification methods on the same Sanger/Broad integrated dataset as David referred to (https://doi.org/10.1038/s41467-021-21898-7).

Of course the datasets are static but Paula believes updates to Cancer DepMap from BROAD institute are minor now and Sanger contribution is not being updated (although this https://depmap.org/portal/achilles/ page says more data is expected over the next 5y for 2000 cell lines, but that could be coming to an end now, preprint from 2019 https://www.biorxiv.org/content/10.1101/720243v1).

According to the paper, the recommended method for getting core-fitness essential genes is ADaM ("ADaM" tab in Table S1) and the recommended set of common-essential genes (more lenient set) is by FiPer AUC ("FiPer AUC" tab in Table S1).

Note that the paper recommends the more lenient set of common-essential genes especially for target prioritisation:

These two-level of stringency make CoRe suitable for a variety of use-case scenarios. These range from the robust identification of new human core essential genes (where minimising false positive is essential, thus CFGs should be preferred to CEGs), to filtering out potential cytotoxic candidates when focusing on context-specific essential genes while identifying and prioritising new therapeutic targets (where is more important to minimise the false negatives, thus CEGs should be preferred to CFGs).

inessmit commented 1 year ago

It might also be of interest to provide the annotation of reference 'non-essential gene' as a possible indicator of greater safety/positive prioritisation factor.

There is a reference dataset of 927 non-essential genes from https://pubmed.ncbi.nlm.nih.gov/24987113/ available here: http://tko.ccbr.utoronto.ca/Data/reference_essentials_and_nonessentials_sym_hgnc_entrez.xlsx

Table S1 (https://static-content.springer.com/esm/art%3A10.1186%2Fs12864-021-08129-5/MediaObjects/12864_2021_8129_MOESM1_ESM.xlsx) also contains this same set in the tab "curated_BAGEL_nonEssential" (the curation has subtracted some cancer driver genes).

d0choa commented 1 year ago

Balanced recent review on the topic:

https://link.springer.com/article/10.1007/s00335-023-09984-1

inessmit commented 1 year ago

@buniello

Results all based on gene symbol lists

Counts

Overlaps

Based on these numbers it looks to me like the current DepMap set is quite similar in size/overlap to the FiPer AUC set, so possibly the current DepMap set as suggested by David would be a good option quite similar to the FiPer AUC set.

Differences could be due to different methods/processing and new DepMap data having been added since the CoRe paper release, and also changes to Gene Symbols.

Non-essential genes

There are several versions of the same set of non-essential genes originally published in Hart et al. 2014 (http://tko.ccbr.utoronto.ca/#) which was created based on "genes that are not expressed in the majority of tissues and cell lines" (https://doi.org/10.15252/msb.20145216).

The DepMap readme says:

The essential and nonessential controls used throughout the analysis are the Hart reference nonessentials and the intersection of the Hart and Blomen essentials. See Hart et al., Mol. Syst. Biol, 2014 and Blomen et al., Science, 2015. Lists of these genes are provided as AchillesCommonEssentialControls.csv and AchillesNonessentialControls.csv.

I'm not sure why the number of genes is lower in the DepMap file. There are some discrepancies in the gene list, I suspect this is due to gene symbol changes (e.g. in the original file there is SEPT14 gene, which is now called SEPTIN14).

Sources

ireneisdoomed commented 1 year ago

This upcoming data has been discussed in the Safety meeting today. @d0choa and I have been inspecting the Cancer Dependency Map to see how we could enrich the essential/non essential assessment further with context specific data. We think the gene perturbations effects page is of interest, so we could try to emulate it in a new widget. See KRAS for example: image

With regard to the data, it is all easily accessible in CSV format. Potentially we'd need to process 3 dependencies:

@inessmit, @buniello: what do you think?

inessmit commented 1 year ago

Great find, that looks really interesting! Seems relevant to highlight that e.g. KRAS is more essential (than the median of pan-essential genes) in Ampulla of Vater and Pancreas cell lines, and also conversely, that it's less essential in the other tissues! Especially e.g. liver and kidney are important for safety so it would be useful to know it's not essential in those tissues.

buniello commented 1 year ago

I agree, really great find. Will look into that a bit more in the context of the widget. Thank you for this!

buniello commented 1 year ago

Now that the work has been scoped, the AIs for the release are:

a) (potential) UBERON mappings - @ireneisdoomed could you please oversee this task, happy to discuss further b) Include file in PIS - @mbdebian and myself will plan details for b,c,d this week c) map gene symbol to Ensembl and include data field in the target step ETL [BE] d) expose field in the API [BE] e) show the core essential gene flag/chip on the target page AND build up a similar visualisation to the DepMap portal (as shown in the screenshot above) [FE]- @LucaFumis let's discuss this on wednesday

ireneisdoomed commented 1 year ago

I've made a pass to the UBERON mappings. These are my suggestions:

Lineage Mapping IDs Mapping Labels
Ovary/Fallopian Tube UBERON_0000992;UBERON_0003889 ovary;fallopian tube
Myeloid UBERON_0012429 hematopoietic tissue
Bowel UBERON_0000160 intestine
Skin UBERON_0002097 skin of body
Bladder/Urinary Tract UBERON_0018707;UBERON_0011143;UBERON_0001556 bladder organ;upper urinary tract;lower urinary tract
Lung UBERON_0002048 lung
Kidney UBERON_0002113 kidney
Breast UBERON_0000310 breast
Lymphoid UBERON_0001744 lymphoid tissue
Pancreas UBERON_0001264 pancreas
CNS/Brain UBERON_0001017 central nervous system
Soft Tissue UBERON_0034929 external soft tissue zone
Bone UBERON_0002481 bone tissue
Fibroblast CL_0000057 fibroblast
Esophagus/Stomach UBERON_0001043;UBERON_0000945 esophagus;stomach
Thyroid UBERON_0002046 thyroid gland
Peripheral Nervous System UBERON_0000010 Peripheral Nervous System
Pleura UBERON_0000977 pleura
Prostate UBERON_0002367 prostate gland
Biliary Tract UBERON_0002394 bile duct
Head and Neck UBERON_0007811 craniocervical region
Uterus UBERON_0000995 uterus
Ampulla of Vater UBERON_0004913 hepatopancreatic ampulla
Liver UBERON_0002107 liver
Cervix UBERON_0000002 uterine cervix
Eye UBERON_0000970 eye
Vulva/Vagina UBERON_0000997 mammalian vulva
Adrenal Gland UBERON_0002369 adrenal gland
Testis UBERON_0000473 testis
I guess that we want to avoid having multiple UBERONs for a single lineage. In this case, I guess we'd have to go for the more general term. It only affects 3 cases: Lineage Supermapping ID Supermapping Label
Ovary/Fallopian Tube UBERON_0003975 internal female genitalia
Bladder/Urinary Tract UBERON_0001008 renal system
Esophagus/Stomach UBERON_0004921 subdivision of digestive tract

The working document is here: https://docs.google.com/spreadsheets/d/1djqEyXSol2Yde8LUIQDPQ3krjowCm8xERE2p9zTR8IM/edit?usp=sharing

As a side note, we have a repository for this type of curation https://github.com/opentargets/curation/tree/0d8599924e9b7d43b5d4cd6fead074033dc9c8a1/mappings/biosystem Whenever we decide where this pipeline is going to sit, these mappings could be pulled from there.

buniello commented 1 year ago

Thank you @ireneisdoomed! they look good to me. I will follow up with @mbdebian @DSuveges for the next steps

DSuveges commented 1 year ago

@d0choa How granular do we want to be with this widget? If we want to kind of "replicate" the the depmap plot, we need to capture 1000 screen data for 2000 essential genes, meaning 2M datapoint. If so, do we want to capture the "colors" and size of the dots as well? They correspond to mutation class and expression levels.

Just capturing the gene effect one row in the plot would look like this:

+----------+------------+-----------+------------+---------+------------+----------------+-------------------+
|  depmapId|targetSymbol| geneEffect|cellLineName|  modelId|cellLineName| oncotreeLineage|  diseaseFromSource|
+----------+------------+-----------+------------+---------+------------+----------------+-------------------+
|ACH-000182|        KRAS| -1.8043776|     SNU-869|SIDM00159|     SNU-869|Ampulla of Vater|Ampullary Carcinoma|
|ACH-000377|        KRAS| -0.7470034|     SNU-478|SIDM00160|     SNU-478|Ampulla of Vater|Ampullary Carcinoma|
|ACH-001862|        KRAS| -2.4496787|   TGBC52TKB|     null|   TGBC52TKB|Ampulla of Vater|Ampullary Carcinoma|
|ACH-002023|        KRAS|-0.50364685|   TGBC18TKB|     null|   TGBC18TKB|Ampulla of Vater|Ampullary Carcinoma|
+----------+------------+-----------+------------+---------+------------+----------------+-------------------+

Keeping the depmap id would allow to link out to depmap, given we are using sanger model ids, I think we should keep that here as well, but we can drop.

If we decide to keep data as granular as possible, what expectation the BE/FE has for the aggregation? We can leave the table all exploded, aggregated by target, or aggreagated by target + lineage, which would be the closest to the depmap plot (in case we want to go that path).

DSuveges commented 1 year ago

Proposed schema:

-RECORD 0---------------------------------
 targetSymbol      | AAMP                 
 depmapId          | ACH-000697           
 cellLineName      | A3/KAW               
 diseaseCellLineId | SIDM00495            
 diseaseFromSource | Non-Hodgkin Lymphoma 
 tissueId          | UBERON_0001744       
 tissueName        | lymphoid tissue      
 mutation          | null                 
 geneEffect        | -0.90650654          
 expression        | 6.2922297            
only showing top 1 row

The above fields capture all the granularity that needed to replicate the depmap perturbation effect plot. Except the conserving and non-conserving mutation, which seems to be a bit more complicated to pull. The structure can be changed: we can consider grouping the data by tissue if that would significantly help the FE.

The dataset contains 2M datapoints for the 1855 essential genes. The size of the resulting parquet is 21MB. If anyone interested the first version is here: gs://ot-team/dsuveges/essentiality_v1.parquet

DSuveges commented 1 year ago

Update:

 targetSymbol      | AAMP                         
 depmapId          | ACH-000018                   
 cellLineName      | T24                          
 diseaseCellLineId | SIDM01184                    
 diseaseFromSource | Bladder Urothelial Carcinoma 
 tissueId          | UBERON_0001008               
 tissueName        | renal system                 
 mutation          | null                         
 geneEffect        | -1.1532661                   
 expression        | 6.7527485                    
 isEssential       | true                         
only showing top 1 row
DSuveges commented 1 year ago

This table as it is, very long. We can decide to group the data by genes (leaving 1000 objects in the depmapScreens array):

root
 |-- targetSymbol: string (nullable = true)
 |-- isEssential: boolean (nullable = true)
 |-- depmapScreens: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- depmapId: string (nullable = true)
 |    |    |-- cellLineName: string (nullable = true)
 |    |    |-- diseaseCellLineId: string (nullable = true)
 |    |    |-- diseaseFromSource: string (nullable = true)
 |    |    |-- tissueId: string (nullable = true)
 |    |    |-- tissueName: string (nullable = true)
 |    |    |-- mutation: string (nullable = true)
 |    |    |-- geneEffect: float (nullable = true)
 |    |    |-- expression: float (nullable = true)

The data can further be aggregated by grouping the screens by tissue. This would recapitulate the data model on the depmap website:

root
 |-- targetSymbol: string (nullable = true)
 |-- isEssential: boolean (nullable = true)
 |-- depMapEssentiality: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- tissueId: string (nullable = true)
 |    |    |-- tissueName: string (nullable = true)
 |    |    |-- screens: array (nullable = false)
 |    |    |    |-- element: struct (containsNull = false)
 |    |    |    |    |-- depmapId: string (nullable = true)
 |    |    |    |    |-- cellLineName: string (nullable = true)
 |    |    |    |    |-- diseaseFromSource: string (nullable = true)
 |    |    |    |    |-- diseaseCellLineId: string (nullable = true)
 |    |    |    |    |-- mutation: string (nullable = true)
 |    |    |    |    |-- geneEffect: float (nullable = true)
 |    |    |    |    |-- expression: float (nullable = true)

This latter would look like this:

{
  "targetSymbol": "CABYR",
  "isEssential": false,
  "depMapEssentiality": [
    {
      "tissueId": "UBERON_0004913",
      "tissueName": "hepatopancreatic ampulla",
      "screens": [
        {
          "depmapId": "ACH-002023",
          "cellLineName": "TGBC18TKB",
          "diseaseFromSource": "Ampullary Carcinoma",
          "geneEffect": -0.03937165,
          "expression": 2.2898345
        },
        {
          "depmapId": "ACH-001862",
          "cellLineName": "TGBC52TKB",
          "diseaseFromSource": "Ampullary Carcinoma",
          "geneEffect": 0.06833311,
          "expression": 1.722466
        }
      ]
    },
    {
      "tissueId": "UBERON_0004921",
      "tissueName": "subdivision of digestive tract",
      "screens": [
        {
          "depmapId": "ACH-000855",
          "cellLineName": "KYSE-150",
          "diseaseFromSource": "Esophageal Squamous Cell Carcinoma",
          "diseaseCellLineId": "SIDM01031",
          "geneEffect": 0.045919016,
          "expression": 1.9597702
        },
        {
          "depmapId": "ACH-000144",
          "cellLineName": "RERF-GC-1B",
          "diseaseFromSource": "Esophagogastric Adenocarcinoma",
          "diseaseCellLineId": "SIDM00358",
          "geneEffect": 0.036817603,
          "expression": 4.9968405
        },
      ]
    },
  ]
}
d0choa commented 1 year ago

I don't think it could get any better than this.

DSuveges commented 1 year ago

This level of nestedness is a bit bothers me, and makes the schema a bit hard to expand in case other datafeeds might got introduced. But for a while it would certainly do the job.

d0choa commented 1 year ago

the flatter option would also do the job FE-wise if you feel more comfortable with it. We will always query one gene at time and dump all the information for that gene at once.

buniello commented 1 year ago

will discuss options for aggregations with @LucaFumis. Thanks @DSuveges, it is looking great!

DSuveges commented 1 year ago

@LucaFumis, here's a link to a json object for a single gene, containing measurements from ~1k screens. link

buniello commented 1 year ago

Discussed next steps (PIS, map gene symbol to Ensembl and include data field in the target step ETL) with @mbdebian today

buniello commented 1 year ago

Discussed today in the office: @carcruz is going to look into visualisation library option to assess feasibility. Secondary option would be to build the visualisation up manually.

buniello commented 1 year ago

As discussed with @LucaFumis

Suggested heading text for the DM widget: Gene Essentiality assessment obtained through CRISPR loss-of-function screens in a wide range of cancer cell lines. Source: DepMap Portal.

Tooltip Box Schema

"cellLineName"
Disease: "diseaseFromSource"
Gene Effect: "geneEffect"
Expression: "expression"

To add for MVP:

Nice to have:

LucaFumis commented 1 year ago

Quick update on the front end side of the ticket. We're using Plotly to create the visualisation. Just pushed latest changes so the preview should be up to date. It's still work in progress, so can change things, but getting there.

To my understanding, for this type of plot (Plotly's "box") it's not possible to size/style individually for each point based on data.

buniello commented 1 year ago

Discussed in FE meeting: different options/positions for the target essentiality chip on the target page. @LucaFumis will implement accordingly

LucaFumis commented 1 year ago

Target essentiality chip with tooltip:

Screenshot 2023-06-09 at 11 12 45
buniello commented 1 year ago

Discussed now with @LucaFumis. There are some little adjustments to make for this widget:

buniello commented 1 year ago

In the next release, we may want to explore further on how to visualise different gene expression values by displaying different sized dots in our version of the DepMap plot. We may also want to explore how to visualise the different mutations records (hotspot, damaging, non-conserving and other), though this is not critical for our purpose.

This is for Luca and myself to discuss further.

d0choa commented 1 year ago

This work is done. So I'm closing. @buniello feel free to open subsequent tickets for future work. Thanks, everyone!