d0choa commented 1 year ago

Background

A piece of information missing from the target page is whether the target can be catalogued as a core essential gene. These genes are unlikely to tolerate an inhibition and are therefore susceptible to causing adverse events if modulated. Knowing that a target is essential would generally discourage a drug discovery scientist to develop an inhibition strategy (exceptions aside).

One source of information we can use is the Cancer Dependency Map. In this project and its ancillary projects (Achilles etc.), they measured fitness after the inhibition of individual genes across a number of cell lines. They catalogued a gene as core essential If the majority of cell lines died after inhibition/KO. Although in cancer, this experiment represents a good proxy of whether loss-of-functions are tolerated across a diverse set of tissues. More details here

Data

There are different tiers of data, but I think we would be ok with the slimmest dataset CRISPRInferredCommonEssentials.csv containing 1,856 gene symbols (25kb). The file is available to download on this page: https://depmap.org/portal/download/all/ and apparently hosted in AWS.

Actions

We might need a series of tickets to implement this feature:

a) Include file in PIS [BE] b) map gene symbol to Ensembl and include data field in the target step ETL [BE] c) expose field in the API [BE] d) show the core essential gene flag/chip on the target page. [FE]

In the second phase, we will probably add a target prioritisation column as well, so it's available in AOTF.

inessmit commented 1 year ago

The gene essentiality field will be useful for safety analyses so it would be great to have this on the target page.

I was recommended this paper by Paula Weidemueller from Evangelia's group as the most comprehensive benchmark paper: https://doi.org/10.1186/s12864-021-08129-5

Table S1 (https://static-content.springer.com/esm/art%3A10.1186%2Fs12864-021-08129-5/MediaObjects/12864_2021_8129_MOESM1_ESM.xlsx) contains all the datasets (already mapped to ENS), which were creating running a few gene-essentiality identification methods on the same Sanger/Broad integrated dataset as David referred to (https://doi.org/10.1038/s41467-021-21898-7).

Of course the datasets are static but Paula believes updates to Cancer DepMap from BROAD institute are minor now and Sanger contribution is not being updated (although this https://depmap.org/portal/achilles/ page says more data is expected over the next 5y for 2000 cell lines, but that could be coming to an end now, preprint from 2019 https://www.biorxiv.org/content/10.1101/720243v1).

According to the paper, the recommended method for getting core-fitness essential genes is ADaM ("ADaM" tab in Table S1) and the recommended set of common-essential genes (more lenient set) is by FiPer AUC ("FiPer AUC" tab in Table S1).

Note that the paper recommends the more lenient set of common-essential genes especially for target prioritisation:

These two-level of stringency make CoRe suitable for a variety of use-case scenarios. These range from the robust identification of new human core essential genes (where minimising false positive is essential, thus CFGs should be preferred to CEGs), to filtering out potential cytotoxic candidates when focusing on context-specific essential genes while identifying and prioritising new therapeutic targets (where is more important to minimise the false negatives, thus CEGs should be preferred to CFGs).

inessmit commented 1 year ago

It might also be of interest to provide the annotation of reference 'non-essential gene' as a possible indicator of greater safety/positive prioritisation factor.

There is a reference dataset of 927 non-essential genes from https://pubmed.ncbi.nlm.nih.gov/24987113/ available here: http://tko.ccbr.utoronto.ca/Data/reference_essentials_and_nonessentials_sym_hgnc_entrez.xlsx

Table S1 (https://static-content.springer.com/esm/art%3A10.1186%2Fs12864-021-08129-5/MediaObjects/12864_2021_8129_MOESM1_ESM.xlsx) also contains this same set in the tab "curated_BAGEL_nonEssential" (the curation has subtracted some cancer driver genes).

d0choa commented 1 year ago

Balanced recent review on the topic:

https://link.springer.com/article/10.1007/s00335-023-09984-1

inessmit commented 1 year ago

@buniello

I compared the essential gene lists from the CoRe paper (based on an earlier release of DepMap) and the latest DepMap release
I compared the non-essential gene list from CoRe, DepMap and original Hart 2014 paper

Results all based on gene symbol lists

Counts

The CoRe paper's ADaM set (strict set) contains 1075 genes
CoRe's lenient set (FiPer AUC) contains 1987 genes
DepMap Common essentials contains 1855 genes

Overlaps

Almost all (1060/1075) of the ADaM set are contained in the DepMap set
Overlap between FiPer AUC and DepMap set is 1738 genes (out of 1987 and 1855 respectively)
FiPer AUC contains 249 genes not contained in DepMap set
DepMap set contains 117 genes not contained in the FiPer AUC

Based on these numbers it looks to me like the current DepMap set is quite similar in size/overlap to the FiPer AUC set, so possibly the current DepMap set as suggested by David would be a good option quite similar to the FiPer AUC set.

Differences could be due to different methods/processing and new DepMap data having been added since the CoRe paper release, and also changes to Gene Symbols.

Non-essential genes

There are several versions of the same set of non-essential genes originally published in Hart et al. 2014 (http://tko.ccbr.utoronto.ca/#) which was created based on "genes that are not expressed in the majority of tissues and cell lines" (https://doi.org/10.15252/msb.20145216).

The original set contains 927 genes
The CoRe paper published a curated set of 921 genes by subtracting cancer driver genes from above set
Current DepMap file contains 781 genes

The DepMap readme says:

The essential and nonessential controls used throughout the analysis are the Hart reference nonessentials and the intersection of the Hart and Blomen essentials. See Hart et al., Mol. Syst. Biol, 2014 and Blomen et al., Science, 2015. Lists of these genes are provided as AchillesCommonEssentialControls.csv and AchillesNonessentialControls.csv.

I'm not sure why the number of genes is lower in the DepMap file. There are some discrepancies in the gene list, I suspect this is due to gene symbol changes (e.g. in the original file there is SEPT14 gene, which is now called SEPTIN14).

Sources

CoRe paper (https://doi.org/10.1186/s12864-021-08129-5, Table S1 Excel tabs: ADaM, FiPer AUC and curated_BAGEL_nonEssential - caution with the curated_BAGEL_nonEssential tab which has SEPT14 gene on line 743 which Excel makes into date format... )
DepMap: https://depmap.org/portal/download/all/, files: CRISPRInferredCommonEssentials.csv and AchillesNonessentialControls.csv
Hart 2014: http://tko.ccbr.utoronto.ca/#, https://doi.org/10.15252/msb.20145216

ireneisdoomed commented 1 year ago

This upcoming data has been discussed in the Safety meeting today. @d0choa and I have been inspecting the Cancer Dependency Map to see how we could enrich the essential/non essential assessment further with context specific data. We think the gene perturbations effects page is of interest, so we could try to emulate it in a new widget. See KRAS for example:

Genes that overall are below the -1 threshold are considered to be essential, and a score of 0 means non essentiality. This assessment could be an additional column in the prioritisation page that informs about Safety.
The granularity per tissue will also be displayed. Each point in the diagram represents a diff cell line, and its size refers to whether it is differentially expressed.

With regard to the data, it is all easily accessible in CSV format. Potentially we'd need to process 3 dependencies:

CRISPRGeneEffect.csv. Effect estimates for all models
Model.csv. Cell line metadata to aggregate each cell line into lineages. There are 1826 cell lines and 31 different lineages, which is a feasible number to be represented in the widget.
CRISPRInferredCommonEssentials.csv List of genes identified as dependent across all lines.

@inessmit, @buniello: what do you think?

inessmit commented 1 year ago

Great find, that looks really interesting! Seems relevant to highlight that e.g. KRAS is more essential (than the median of pan-essential genes) in Ampulla of Vater and Pancreas cell lines, and also conversely, that it's less essential in the other tissues! Especially e.g. liver and kidney are important for safety so it would be useful to know it's not essential in those tissues.

buniello commented 1 year ago

I agree, really great find. Will look into that a bit more in the context of the widget. Thank you for this!

buniello commented 1 year ago

Now that the work has been scoped, the AIs for the release are:

a) (potential) UBERON mappings - @ireneisdoomed could you please oversee this task, happy to discuss further b) Include file in PIS - @mbdebian and myself will plan details for b,c,d this week c) map gene symbol to Ensembl and include data field in the target step ETL [BE] d) expose field in the API [BE] e) show the core essential gene flag/chip on the target page AND build up a similar visualisation to the DepMap portal (as shown in the screenshot above) [FE]- @LucaFumis let's discuss this on wednesday

ireneisdoomed commented 1 year ago

I've made a pass to the UBERON mappings. These are my suggestions:

Lineage	Mapping IDs	Mapping Labels
Ovary/Fallopian Tube	UBERON_0000992;UBERON_0003889	ovary;fallopian tube
Myeloid	UBERON_0012429	hematopoietic tissue
Bowel	UBERON_0000160	intestine
Skin	UBERON_0002097	skin of body
Bladder/Urinary Tract	UBERON_0018707;UBERON_0011143;UBERON_0001556	bladder organ;upper urinary tract;lower urinary tract
Lung	UBERON_0002048	lung
Kidney	UBERON_0002113	kidney
Breast	UBERON_0000310	breast
Lymphoid	UBERON_0001744	lymphoid tissue
Pancreas	UBERON_0001264	pancreas
CNS/Brain	UBERON_0001017	central nervous system
Soft Tissue	UBERON_0034929	external soft tissue zone
Bone	UBERON_0002481	bone tissue
Fibroblast	CL_0000057	fibroblast
Esophagus/Stomach	UBERON_0001043;UBERON_0000945	esophagus;stomach
Thyroid	UBERON_0002046	thyroid gland
Peripheral Nervous System	UBERON_0000010	Peripheral Nervous System
Pleura	UBERON_0000977	pleura
Prostate	UBERON_0002367	prostate gland
Biliary Tract	UBERON_0002394	bile duct
Head and Neck	UBERON_0007811	craniocervical region
Uterus	UBERON_0000995	uterus
Ampulla of Vater	UBERON_0004913	hepatopancreatic ampulla
Liver	UBERON_0002107	liver
Cervix	UBERON_0000002	uterine cervix
Eye	UBERON_0000970	eye
Vulva/Vagina	UBERON_0000997	mammalian vulva
Adrenal Gland	UBERON_0002369	adrenal gland
Testis	UBERON_0000473	testis

I guess that we want to avoid having multiple UBERONs for a single lineage. In this case, I guess we'd have to go for the more general term. It only affects 3 cases:	Lineage	Supermapping ID
Ovary/Fallopian Tube	UBERON_0003975	internal female genitalia
Bladder/Urinary Tract	UBERON_0001008	renal system
Esophagus/Stomach	UBERON_0004921	subdivision of digestive tract

The working document is here: https://docs.google.com/spreadsheets/d/1djqEyXSol2Yde8LUIQDPQ3krjowCm8xERE2p9zTR8IM/edit?usp=sharing

As a side note, we have a repository for this type of curation https://github.com/opentargets/curation/tree/0d8599924e9b7d43b5d4cd6fead074033dc9c8a1/mappings/biosystem Whenever we decide where this pipeline is going to sit, these mappings could be pulled from there.

buniello commented 1 year ago

Thank you @ireneisdoomed! they look good to me. I will follow up with @mbdebian @DSuveges for the next steps

DSuveges commented 1 year ago

@d0choa How granular do we want to be with this widget? If we want to kind of "replicate" the the depmap plot, we need to capture 1000 screen data for 2000 essential genes, meaning 2M datapoint. If so, do we want to capture the "colors" and size of the dots as well? They correspond to mutation class and expression levels.

Just capturing the gene effect one row in the plot would look like this:

+----------+------------+-----------+------------+---------+------------+----------------+-------------------+
|  depmapId|targetSymbol| geneEffect|cellLineName|  modelId|cellLineName| oncotreeLineage|  diseaseFromSource|
+----------+------------+-----------+------------+---------+------------+----------------+-------------------+
|ACH-000182|        KRAS| -1.8043776|     SNU-869|SIDM00159|     SNU-869|Ampulla of Vater|Ampullary Carcinoma|
|ACH-000377|        KRAS| -0.7470034|     SNU-478|SIDM00160|     SNU-478|Ampulla of Vater|Ampullary Carcinoma|
|ACH-001862|        KRAS| -2.4496787|   TGBC52TKB|     null|   TGBC52TKB|Ampulla of Vater|Ampullary Carcinoma|
|ACH-002023|        KRAS|-0.50364685|   TGBC18TKB|     null|   TGBC18TKB|Ampulla of Vater|Ampullary Carcinoma|
+----------+------------+-----------+------------+---------+------------+----------------+-------------------+

Keeping the depmap id would allow to link out to depmap, given we are using sanger model ids, I think we should keep that here as well, but we can drop.

If we decide to keep data as granular as possible, what expectation the BE/FE has for the aggregation? We can leave the table all exploded, aggregated by target, or aggreagated by target + lineage, which would be the closest to the depmap plot (in case we want to go that path).

DSuveges commented 1 year ago

Proposed schema:

-RECORD 0---------------------------------
 targetSymbol      | AAMP                 
 depmapId          | ACH-000697           
 cellLineName      | A3/KAW               
 diseaseCellLineId | SIDM00495            
 diseaseFromSource | Non-Hodgkin Lymphoma 
 tissueId          | UBERON_0001744       
 tissueName        | lymphoid tissue      
 mutation          | null                 
 geneEffect        | -0.90650654          
 expression        | 6.2922297            
only showing top 1 row

targetSymbol: ETL needs to map these values to Ensembl gene identifier.
depmapId: This is the identifier the screen that can be linked to depmap. eg. ACH-001494
cellLineName: name of the cell line from the model dataset.
diseaseCellLineId: Sanger cell passport identifier.
diseaseFromSource: disease label from the depmap model dataset
tissueId: UBERON identifier of the mapped tissue
tissueName: name of the mapped tissue. Can be other where mapping is not available. (Honestly I would drop it, but currently the ETL cannot resolve tissue ids to names. And there are rows, where the id is not available.
mutation: mutation category. Currently it can only be 'damaging' and 'hotspot'
geneEffect: calculated gene effect.
expression: expression level measured.

The above fields capture all the granularity that needed to replicate the depmap perturbation effect plot. Except the conserving and non-conserving mutation, which seems to be a bit more complicated to pull. The structure can be changed: we can consider grouping the data by tissue if that would significantly help the FE.

The dataset contains 2M datapoints for the 1855 essential genes. The size of the resulting parquet is 21MB. If anyone interested the first version is here: gs://ot-team/dsuveges/essentiality_v1.parquet

DSuveges commented 1 year ago

Update:

New boolean flag included: isEssential explaining if the given gene was included in the essential gene list.
All genes x screen pairs are included in the final output. This would allows the target engine to differentiate essential genes, non-essential genes, and genes where this information is not available.
Adding all genes means adding 19M data points (~17.5k genes x 1k cells). The size of the resulting dataset is ~250MB.

 targetSymbol      | AAMP                         
 depmapId          | ACH-000018                   
 cellLineName      | T24                          
 diseaseCellLineId | SIDM01184                    
 diseaseFromSource | Bladder Urothelial Carcinoma 
 tissueId          | UBERON_0001008               
 tissueName        | renal system                 
 mutation          | null                         
 geneEffect        | -1.1532661                   
 expression        | 6.7527485                    
 isEssential       | true                         
only showing top 1 row

DSuveges commented 1 year ago

This table as it is, very long. We can decide to group the data by genes (leaving 1000 objects in the depmapScreens array):

root
 |-- targetSymbol: string (nullable = true)
 |-- isEssential: boolean (nullable = true)
 |-- depmapScreens: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- depmapId: string (nullable = true)
 |    |    |-- cellLineName: string (nullable = true)
 |    |    |-- diseaseCellLineId: string (nullable = true)
 |    |    |-- diseaseFromSource: string (nullable = true)
 |    |    |-- tissueId: string (nullable = true)
 |    |    |-- tissueName: string (nullable = true)
 |    |    |-- mutation: string (nullable = true)
 |    |    |-- geneEffect: float (nullable = true)
 |    |    |-- expression: float (nullable = true)

The data can further be aggregated by grouping the screens by tissue. This would recapitulate the data model on the depmap website:

root
 |-- targetSymbol: string (nullable = true)
 |-- isEssential: boolean (nullable = true)
 |-- depMapEssentiality: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- tissueId: string (nullable = true)
 |    |    |-- tissueName: string (nullable = true)
 |    |    |-- screens: array (nullable = false)
 |    |    |    |-- element: struct (containsNull = false)
 |    |    |    |    |-- depmapId: string (nullable = true)
 |    |    |    |    |-- cellLineName: string (nullable = true)
 |    |    |    |    |-- diseaseFromSource: string (nullable = true)
 |    |    |    |    |-- diseaseCellLineId: string (nullable = true)
 |    |    |    |    |-- mutation: string (nullable = true)
 |    |    |    |    |-- geneEffect: float (nullable = true)
 |    |    |    |    |-- expression: float (nullable = true)

This latter would look like this:

{
  "targetSymbol": "CABYR",
  "isEssential": false,
  "depMapEssentiality": [
    {
      "tissueId": "UBERON_0004913",
      "tissueName": "hepatopancreatic ampulla",
      "screens": [
        {
          "depmapId": "ACH-002023",
          "cellLineName": "TGBC18TKB",
          "diseaseFromSource": "Ampullary Carcinoma",
          "geneEffect": -0.03937165,
          "expression": 2.2898345
        },
        {
          "depmapId": "ACH-001862",
          "cellLineName": "TGBC52TKB",
          "diseaseFromSource": "Ampullary Carcinoma",
          "geneEffect": 0.06833311,
          "expression": 1.722466
        }
      ]
    },
    {
      "tissueId": "UBERON_0004921",
      "tissueName": "subdivision of digestive tract",
      "screens": [
        {
          "depmapId": "ACH-000855",
          "cellLineName": "KYSE-150",
          "diseaseFromSource": "Esophageal Squamous Cell Carcinoma",
          "diseaseCellLineId": "SIDM01031",
          "geneEffect": 0.045919016,
          "expression": 1.9597702
        },
        {
          "depmapId": "ACH-000144",
          "cellLineName": "RERF-GC-1B",
          "diseaseFromSource": "Esophagogastric Adenocarcinoma",
          "diseaseCellLineId": "SIDM00358",
          "geneEffect": 0.036817603,
          "expression": 4.9968405
        },
      ]
    },
  ]
}

d0choa commented 1 year ago

I don't think it could get any better than this.

DSuveges commented 1 year ago

This level of nestedness is a bit bothers me, and makes the schema a bit hard to expand in case other datafeeds might got introduced. But for a while it would certainly do the job.

d0choa commented 1 year ago

the flatter option would also do the job FE-wise if you feel more comfortable with it. We will always query one gene at time and dump all the information for that gene at once.

buniello commented 1 year ago

will discuss options for aggregations with @LucaFumis. Thanks @DSuveges, it is looking great!

DSuveges commented 1 year ago

@LucaFumis, here's a link to a json object for a single gene, containing measurements from ~1k screens. link

buniello commented 1 year ago

Discussed next steps (PIS, map gene symbol to Ensembl and include data field in the target step ETL) with @mbdebian today

buniello commented 1 year ago

Discussed today in the office: @carcruz is going to look into visualisation library option to assess feasibility. Secondary option would be to build the visualisation up manually.

buniello commented 1 year ago

As discussed with @LucaFumis

Suggested heading text for the DM widget: Gene Essentiality assessment obtained through CRISPR loss-of-function screens in a wide range of cancer cell lines. Source: DepMap Portal.

Tooltip Box Schema

"cellLineName"
Disease: "diseaseFromSource"
Gene Effect: "geneEffect"
Expression: "expression"

To add for MVP:

[x] Adding the vertical Red line at -1 in the plot, showing significance cut-off
[ ] Adding Essential Gene tag on Target Page when isEssential is true
[x] Linking out when clicking on cell line/dots: depmapId: This is the identifier the screen that can be linked to depmap. eg. ACH-001494

Nice to have:

[ ] Model size of "dots" on expression value (bigger dots for bigger values)

LucaFumis commented 1 year ago

Quick update on the front end side of the ticket. We're using Plotly to create the visualisation. Just pushed latest changes so the preview should be up to date. It's still work in progress, so can change things, but getting there.

updated tooltips
removed box tooltip (the grey ones)
added vertical line at -1
added link from points to Depmap portal (might need to change implementation; no cursor pointer at the moment)
aligned points and box

To my understanding, for this type of plot (Plotly's "box") it's not possible to size/style individually for each point based on data.

buniello commented 1 year ago

Discussed in FE meeting: different options/positions for the target essentiality chip on the target page. @LucaFumis will implement accordingly

LucaFumis commented 1 year ago

Target essentiality chip with tooltip:

buniello commented 1 year ago

Discussed now with @LucaFumis. There are some little adjustments to make for this widget:

[x] Change its position within the page: move it after "Baseline Expression"
[x] Add x axes title: Gene Effect
[x] Try saving vertical space by reducing distance between individual plots

buniello commented 1 year ago

In the next release, we may want to explore further on how to visualise different gene expression values by displaying different sized dots in our version of the DepMap plot. We may also want to explore how to visualise the different mutations records (hotspot, damaging, non-conserving and other), though this is not critical for our purpose.

This is for Luca and myself to discuss further.

d0choa commented 1 year ago

This work is done. So I'm closing. @buniello feel free to open subsequent tickets for future work. Thanks, everyone!

opentargets / issues

Include target core essentiality from Cancer DepMap #2917

Background

Data

Actions

Results all based on gene symbol lists

Counts

Overlaps

Non-essential genes