Closed d0choa closed 1 year ago
The gene essentiality field will be useful for safety analyses so it would be great to have this on the target page.
I was recommended this paper by Paula Weidemueller from Evangelia's group as the most comprehensive benchmark paper: https://doi.org/10.1186/s12864-021-08129-5
Table S1 (https://static-content.springer.com/esm/art%3A10.1186%2Fs12864-021-08129-5/MediaObjects/12864_2021_8129_MOESM1_ESM.xlsx) contains all the datasets (already mapped to ENS), which were creating running a few gene-essentiality identification methods on the same Sanger/Broad integrated dataset as David referred to (https://doi.org/10.1038/s41467-021-21898-7).
Of course the datasets are static but Paula believes updates to Cancer DepMap from BROAD institute are minor now and Sanger contribution is not being updated (although this https://depmap.org/portal/achilles/ page says more data is expected over the next 5y for 2000 cell lines, but that could be coming to an end now, preprint from 2019 https://www.biorxiv.org/content/10.1101/720243v1).
According to the paper, the recommended method for getting core-fitness essential genes is ADaM ("ADaM" tab in Table S1) and the recommended set of common-essential genes (more lenient set) is by FiPer AUC ("FiPer AUC" tab in Table S1).
Note that the paper recommends the more lenient set of common-essential genes especially for target prioritisation:
These two-level of stringency make CoRe suitable for a variety of use-case scenarios. These range from the robust identification of new human core essential genes (where minimising false positive is essential, thus CFGs should be preferred to CEGs), to filtering out potential cytotoxic candidates when focusing on context-specific essential genes while identifying and prioritising new therapeutic targets (where is more important to minimise the false negatives, thus CEGs should be preferred to CFGs).
It might also be of interest to provide the annotation of reference 'non-essential gene' as a possible indicator of greater safety/positive prioritisation factor.
There is a reference dataset of 927 non-essential genes from https://pubmed.ncbi.nlm.nih.gov/24987113/ available here: http://tko.ccbr.utoronto.ca/Data/reference_essentials_and_nonessentials_sym_hgnc_entrez.xlsx
Table S1 (https://static-content.springer.com/esm/art%3A10.1186%2Fs12864-021-08129-5/MediaObjects/12864_2021_8129_MOESM1_ESM.xlsx) also contains this same set in the tab "curated_BAGEL_nonEssential" (the curation has subtracted some cancer driver genes).
Balanced recent review on the topic:
https://link.springer.com/article/10.1007/s00335-023-09984-1
@buniello
Based on these numbers it looks to me like the current DepMap set is quite similar in size/overlap to the FiPer AUC set, so possibly the current DepMap set as suggested by David would be a good option quite similar to the FiPer AUC set.
Differences could be due to different methods/processing and new DepMap data having been added since the CoRe paper release, and also changes to Gene Symbols.
There are several versions of the same set of non-essential genes originally published in Hart et al. 2014 (http://tko.ccbr.utoronto.ca/#) which was created based on "genes that are not expressed in the majority of tissues and cell lines" (https://doi.org/10.15252/msb.20145216).
The DepMap readme says:
The essential and nonessential controls used throughout the analysis are the Hart reference nonessentials and the intersection of the Hart and Blomen essentials. See Hart et al., Mol. Syst. Biol, 2014 and Blomen et al., Science, 2015. Lists of these genes are provided as AchillesCommonEssentialControls.csv and AchillesNonessentialControls.csv.
I'm not sure why the number of genes is lower in the DepMap file. There are some discrepancies in the gene list, I suspect this is due to gene symbol changes (e.g. in the original file there is SEPT14 gene, which is now called SEPTIN14).
Sources
This upcoming data has been discussed in the Safety meeting today. @d0choa and I have been inspecting the Cancer Dependency Map to see how we could enrich the essential/non essential assessment further with context specific data. We think the gene perturbations effects page is of interest, so we could try to emulate it in a new widget. See KRAS for example:
With regard to the data, it is all easily accessible in CSV format. Potentially we'd need to process 3 dependencies:
CRISPRGeneEffect.csv
. Effect estimates for all modelsModel.csv
. Cell line metadata to aggregate each cell line into lineages. There are 1826 cell lines and 31 different lineages, which is a feasible number to be represented in the widget.CRISPRInferredCommonEssentials.csv
List of genes identified as dependent across all lines.@inessmit, @buniello: what do you think?
Great find, that looks really interesting! Seems relevant to highlight that e.g. KRAS is more essential (than the median of pan-essential genes) in Ampulla of Vater and Pancreas cell lines, and also conversely, that it's less essential in the other tissues! Especially e.g. liver and kidney are important for safety so it would be useful to know it's not essential in those tissues.
I agree, really great find. Will look into that a bit more in the context of the widget. Thank you for this!
Now that the work has been scoped, the AIs for the release are:
a) (potential) UBERON mappings - @ireneisdoomed could you please oversee this task, happy to discuss further b) Include file in PIS - @mbdebian and myself will plan details for b,c,d this week c) map gene symbol to Ensembl and include data field in the target step ETL [BE] d) expose field in the API [BE] e) show the core essential gene flag/chip on the target page AND build up a similar visualisation to the DepMap portal (as shown in the screenshot above) [FE]- @LucaFumis let's discuss this on wednesday
I've made a pass to the UBERON mappings. These are my suggestions:
Lineage | Mapping IDs | Mapping Labels |
---|---|---|
Ovary/Fallopian Tube | UBERON_0000992;UBERON_0003889 | ovary;fallopian tube |
Myeloid | UBERON_0012429 | hematopoietic tissue |
Bowel | UBERON_0000160 | intestine |
Skin | UBERON_0002097 | skin of body |
Bladder/Urinary Tract | UBERON_0018707;UBERON_0011143;UBERON_0001556 | bladder organ;upper urinary tract;lower urinary tract |
Lung | UBERON_0002048 | lung |
Kidney | UBERON_0002113 | kidney |
Breast | UBERON_0000310 | breast |
Lymphoid | UBERON_0001744 | lymphoid tissue |
Pancreas | UBERON_0001264 | pancreas |
CNS/Brain | UBERON_0001017 | central nervous system |
Soft Tissue | UBERON_0034929 | external soft tissue zone |
Bone | UBERON_0002481 | bone tissue |
Fibroblast | CL_0000057 | fibroblast |
Esophagus/Stomach | UBERON_0001043;UBERON_0000945 | esophagus;stomach |
Thyroid | UBERON_0002046 | thyroid gland |
Peripheral Nervous System | UBERON_0000010 | Peripheral Nervous System |
Pleura | UBERON_0000977 | pleura |
Prostate | UBERON_0002367 | prostate gland |
Biliary Tract | UBERON_0002394 | bile duct |
Head and Neck | UBERON_0007811 | craniocervical region |
Uterus | UBERON_0000995 | uterus |
Ampulla of Vater | UBERON_0004913 | hepatopancreatic ampulla |
Liver | UBERON_0002107 | liver |
Cervix | UBERON_0000002 | uterine cervix |
Eye | UBERON_0000970 | eye |
Vulva/Vagina | UBERON_0000997 | mammalian vulva |
Adrenal Gland | UBERON_0002369 | adrenal gland |
Testis | UBERON_0000473 | testis |
I guess that we want to avoid having multiple UBERONs for a single lineage. In this case, I guess we'd have to go for the more general term. It only affects 3 cases: | Lineage | Supermapping ID | Supermapping Label |
---|---|---|---|
Ovary/Fallopian Tube | UBERON_0003975 | internal female genitalia | |
Bladder/Urinary Tract | UBERON_0001008 | renal system | |
Esophagus/Stomach | UBERON_0004921 | subdivision of digestive tract |
The working document is here: https://docs.google.com/spreadsheets/d/1djqEyXSol2Yde8LUIQDPQ3krjowCm8xERE2p9zTR8IM/edit?usp=sharing
As a side note, we have a repository for this type of curation https://github.com/opentargets/curation/tree/0d8599924e9b7d43b5d4cd6fead074033dc9c8a1/mappings/biosystem Whenever we decide where this pipeline is going to sit, these mappings could be pulled from there.
Thank you @ireneisdoomed! they look good to me. I will follow up with @mbdebian @DSuveges for the next steps
@d0choa How granular do we want to be with this widget? If we want to kind of "replicate" the the depmap plot, we need to capture 1000 screen data for 2000 essential genes, meaning 2M datapoint. If so, do we want to capture the "colors" and size of the dots as well? They correspond to mutation class and expression levels.
Just capturing the gene effect one row in the plot would look like this:
+----------+------------+-----------+------------+---------+------------+----------------+-------------------+
| depmapId|targetSymbol| geneEffect|cellLineName| modelId|cellLineName| oncotreeLineage| diseaseFromSource|
+----------+------------+-----------+------------+---------+------------+----------------+-------------------+
|ACH-000182| KRAS| -1.8043776| SNU-869|SIDM00159| SNU-869|Ampulla of Vater|Ampullary Carcinoma|
|ACH-000377| KRAS| -0.7470034| SNU-478|SIDM00160| SNU-478|Ampulla of Vater|Ampullary Carcinoma|
|ACH-001862| KRAS| -2.4496787| TGBC52TKB| null| TGBC52TKB|Ampulla of Vater|Ampullary Carcinoma|
|ACH-002023| KRAS|-0.50364685| TGBC18TKB| null| TGBC18TKB|Ampulla of Vater|Ampullary Carcinoma|
+----------+------------+-----------+------------+---------+------------+----------------+-------------------+
Keeping the depmap id would allow to link out to depmap, given we are using sanger model ids, I think we should keep that here as well, but we can drop.
If we decide to keep data as granular as possible, what expectation the BE/FE has for the aggregation? We can leave the table all exploded, aggregated by target, or aggreagated by target + lineage, which would be the closest to the depmap plot (in case we want to go that path).
Proposed schema:
-RECORD 0---------------------------------
targetSymbol | AAMP
depmapId | ACH-000697
cellLineName | A3/KAW
diseaseCellLineId | SIDM00495
diseaseFromSource | Non-Hodgkin Lymphoma
tissueId | UBERON_0001744
tissueName | lymphoid tissue
mutation | null
geneEffect | -0.90650654
expression | 6.2922297
only showing top 1 row
other
where mapping is not available. (Honestly I would drop it, but currently the ETL cannot resolve tissue ids to names. And there are rows, where the id is not available.'damaging'
and 'hotspot'
The above fields capture all the granularity that needed to replicate the depmap perturbation effect plot. Except the conserving and non-conserving mutation, which seems to be a bit more complicated to pull. The structure can be changed: we can consider grouping the data by tissue if that would significantly help the FE.
The dataset contains 2M datapoints for the 1855 essential genes. The size of the resulting parquet is 21MB. If anyone interested the first version is here: gs://ot-team/dsuveges/essentiality_v1.parquet
Update:
isEssential
explaining if the given gene was included in the essential gene list. targetSymbol | AAMP
depmapId | ACH-000018
cellLineName | T24
diseaseCellLineId | SIDM01184
diseaseFromSource | Bladder Urothelial Carcinoma
tissueId | UBERON_0001008
tissueName | renal system
mutation | null
geneEffect | -1.1532661
expression | 6.7527485
isEssential | true
only showing top 1 row
This table as it is, very long. We can decide to group the data by genes (leaving 1000 objects in the depmapScreens array):
root
|-- targetSymbol: string (nullable = true)
|-- isEssential: boolean (nullable = true)
|-- depmapScreens: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- depmapId: string (nullable = true)
| | |-- cellLineName: string (nullable = true)
| | |-- diseaseCellLineId: string (nullable = true)
| | |-- diseaseFromSource: string (nullable = true)
| | |-- tissueId: string (nullable = true)
| | |-- tissueName: string (nullable = true)
| | |-- mutation: string (nullable = true)
| | |-- geneEffect: float (nullable = true)
| | |-- expression: float (nullable = true)
The data can further be aggregated by grouping the screens by tissue. This would recapitulate the data model on the depmap website:
root
|-- targetSymbol: string (nullable = true)
|-- isEssential: boolean (nullable = true)
|-- depMapEssentiality: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- tissueId: string (nullable = true)
| | |-- tissueName: string (nullable = true)
| | |-- screens: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- depmapId: string (nullable = true)
| | | | |-- cellLineName: string (nullable = true)
| | | | |-- diseaseFromSource: string (nullable = true)
| | | | |-- diseaseCellLineId: string (nullable = true)
| | | | |-- mutation: string (nullable = true)
| | | | |-- geneEffect: float (nullable = true)
| | | | |-- expression: float (nullable = true)
This latter would look like this:
{
"targetSymbol": "CABYR",
"isEssential": false,
"depMapEssentiality": [
{
"tissueId": "UBERON_0004913",
"tissueName": "hepatopancreatic ampulla",
"screens": [
{
"depmapId": "ACH-002023",
"cellLineName": "TGBC18TKB",
"diseaseFromSource": "Ampullary Carcinoma",
"geneEffect": -0.03937165,
"expression": 2.2898345
},
{
"depmapId": "ACH-001862",
"cellLineName": "TGBC52TKB",
"diseaseFromSource": "Ampullary Carcinoma",
"geneEffect": 0.06833311,
"expression": 1.722466
}
]
},
{
"tissueId": "UBERON_0004921",
"tissueName": "subdivision of digestive tract",
"screens": [
{
"depmapId": "ACH-000855",
"cellLineName": "KYSE-150",
"diseaseFromSource": "Esophageal Squamous Cell Carcinoma",
"diseaseCellLineId": "SIDM01031",
"geneEffect": 0.045919016,
"expression": 1.9597702
},
{
"depmapId": "ACH-000144",
"cellLineName": "RERF-GC-1B",
"diseaseFromSource": "Esophagogastric Adenocarcinoma",
"diseaseCellLineId": "SIDM00358",
"geneEffect": 0.036817603,
"expression": 4.9968405
},
]
},
]
}
I don't think it could get any better than this.
This level of nestedness is a bit bothers me, and makes the schema a bit hard to expand in case other datafeeds might got introduced. But for a while it would certainly do the job.
the flatter option would also do the job FE-wise if you feel more comfortable with it. We will always query one gene at time and dump all the information for that gene at once.
will discuss options for aggregations with @LucaFumis. Thanks @DSuveges, it is looking great!
@LucaFumis, here's a link to a json object for a single gene, containing measurements from ~1k screens. link
Discussed next steps (PIS, map gene symbol to Ensembl and include data field in the target step ETL) with @mbdebian today
Discussed today in the office: @carcruz is going to look into visualisation library option to assess feasibility. Secondary option would be to build the visualisation up manually.
As discussed with @LucaFumis
Suggested heading text for the DM widget: Gene Essentiality assessment obtained through CRISPR loss-of-function screens in a wide range of cancer cell lines. Source: DepMap Portal.
Tooltip Box Schema
"cellLineName"
Disease: "diseaseFromSource"
Gene Effect: "geneEffect"
Expression: "expression"
To add for MVP:
-1
in the plot, showing significance cut-offEssential Gene
tag on Target Page when isEssential
is true
depmapId
: This is the identifier the screen that can be linked to depmap. eg. ACH-001494Nice to have:
expression
value (bigger dots for bigger values)Quick update on the front end side of the ticket. We're using Plotly to create the visualisation. Just pushed latest changes so the preview should be up to date. It's still work in progress, so can change things, but getting there.
To my understanding, for this type of plot (Plotly's "box") it's not possible to size/style individually for each point based on data.
Discussed in FE meeting: different options/positions for the target essentiality chip on the target page. @LucaFumis will implement accordingly
Target essentiality chip with tooltip:
Discussed now with @LucaFumis. There are some little adjustments to make for this widget:
Gene Effect
In the next release, we may want to explore further on how to visualise different gene expression
values by displaying different sized dots in our version of the DepMap plot.
We may also want to explore how to visualise the different mutations
records (hotspot, damaging, non-conserving and other), though this is not critical for our purpose.
This is for Luca and myself to discuss further.
This work is done. So I'm closing. @buniello feel free to open subsequent tickets for future work. Thanks, everyone!
Background
A piece of information missing from the target page is whether the target can be catalogued as a core essential gene. These genes are unlikely to tolerate an inhibition and are therefore susceptible to causing adverse events if modulated. Knowing that a target is essential would generally discourage a drug discovery scientist to develop an inhibition strategy (exceptions aside).
One source of information we can use is the Cancer Dependency Map. In this project and its ancillary projects (Achilles etc.), they measured fitness after the inhibition of individual genes across a number of cell lines. They catalogued a gene as core essential If the majority of cell lines died after inhibition/KO. Although in cancer, this experiment represents a good proxy of whether loss-of-functions are tolerated across a diverse set of tissues. More details here
Data
There are different tiers of data, but I think we would be ok with the slimmest dataset
CRISPRInferredCommonEssentials.csv
containing 1,856 gene symbols (25kb). The file is available to download on this page: https://depmap.org/portal/download/all/ and apparently hosted in AWS.Actions
We might need a series of tickets to implement this feature:
a) Include file in PIS [BE] b) map gene symbol to Ensembl and include data field in the target step ETL [BE] c) expose field in the API [BE] d) show the
core essential
gene flag/chip on the target page. [FE]In the second phase, we will probably add a target prioritisation column as well, so it's available in AOTF.