@NathanSkene @KittyMurphy I just noticed that the CTD originally included within MultiEWCE is incorrect. It was not processed to the level of gene properly.

MultiEWCE::load_example_ctd()

Instead, it appears to be at the level of transcripts, which were then randomly assigned unique gene-level IDs. You can also tell this right away when looking at the number of dimensions in the CTD.

@bobGSmith can you please confirm which version of the CTD you used in your original EWCE analyses? If it was this version, I will need to redo all of the analyses. Otherwise, our gene counts will be underestimated by ~2/3 and thus the celltype signatures would be degraded.

Yikes, we’ll spotted Brian

Sent from Outlook for iOShttps://aka.ms/o0ukef

From: Brian M. Schilder @.> Sent: Sunday, March 26, 2023 10:53:17 AM To: neurogenomics/MultiEWCE @.> Cc: Skene, Nathan G @.>; Mention @.> Subject: [neurogenomics/MultiEWCE] Fix CTD (Issue #14)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

@NathanSkenehttps://github.com/NathanSkene @KittyMurphyhttps://github.com/KittyMurphy I just noticed that the CTD originally included within MultiEWCE is incorrect. It was not processed to the level of gene properly.

MultiEWCE::load_example_ctd()

[Screenshot 2023-03-26 at 10 49 38]https://user-images.githubusercontent.com/34280215/227767880-d815078f-98dc-40ef-83fb-74e67fe5e3ee.png

Instead, it appears to be at the level of transcripts, which were then randomly assigned unique gene-level IDs. You can also tell this right away when looking at the number of dimensions in the CTD. [Screenshot 2023-03-26 at 10 50 29]https://user-images.githubusercontent.com/34280215/227767961-9f875ee9-fce4-4abc-9d6f-f60786570d84.png

@bobGSmithhttps://github.com/bobGSmith can you please confirm which version of the CTD you used in your original EWCE analyses? If it was this version, I will need to redo all of the analyses. Otherwise, our gene counts will be underestimated by ~1/3 and thus the celltype signatures would be degraded.

— Reply to this email directly, view it on GitHubhttps://github.com/neurogenomics/MultiEWCE/issues/14, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPE3Y6EVQHDHWMGV2FXLW6AGY3ANCNFSM6AAAAAAWICQCQQ. You are receiving this because you were mentioned.Message ID: @.***>

@bobGSmith could really use your input on this. Just need to know which version of the CTD you actually used to run the EWCE analyses.

After discussing this with @bobGSmith I've confirmed that the issue stems from here, the script that was used to generate the Descartes (and Tabula Muris) CTDs: https://github.com/neurogenomics/rare_disease_celltyping/blob/dev_descartes/source/CTD_Gen.R And the script where the function was run: https://github.com/neurogenomics/rare_disease_celltyping/blob/dev_descartes/Makefile.R

@bobGSmith was given a version of this script @NathanSkene, as it had previously been written by Zijing. Not sure why this was suggested instead of the built-in EWCE::generate_ctd function (which should be have been available at the time), but the script doesn't account for a lot of variability in scRNA-seq dataset formats. In the case of Descartes, one of the consequences of this is not naming the genes correctly. Descartes provides their data at transcript level, so an extra steps must be done to aggregate to the data to gene level. it seems this was not done and so duplicated row names were automatically differentiated by R by appending ".1,",".2",".3", etc to each duplicated gene name.

In conclusion, I will indeed need to reprocess both CTDs and rerun all analyses for this project.

Assessing the issue

Okay, so been going back and trying to reconstruct what happened. If I remember correctly @bobGSmith used the script provided by @NathanSkene because the merged Descartes dataset was too large for EWCE to process all at once. This was when I was still working on the DelayedArray implementation of EWCE which allows it to scale to an arbitrarily large dataset. So the strategy @bobGSmith took was to chunk each tissue-specific file from Descartes, aggregate them to cell-type level separately and then merge them back together into one CTD.

Another complication arose due to the way the genes in Descartes are annotated. As I mentioned, they appear to be annotated at the level of transcripts. Using the Gene Annotations file provided on their website, we can inspect this: https://descartes.brotmanbaty.org/bbi/human-gene-expression-during-development/

genes <- readRDS(url("https://atlas.fredhutch.org/data/bbi/descartes/human_gtex/downloads/data_summarize_fetus_data/df_gene.RDS"))
dim(genes)
# [1] 63561     3

Screenshot 2023-04-12 at 13 13 24

The first thing I noticed was the ".#" suffix at the end of each Ensembl gene ID. This usually indicates that the label is referring to a specific transcript of that gene. However, this suffix usually is given at the end of an Ensembl transcript ID (ENST) not a gene id (ENSG), which makes this annotation style quite confusing. For example, see the Transcript Table here: https://grch37.ensembl.org/Homo_sapiens/Gene/TranscriptComparison?g=ENSG00000223972;r=1:11869-14412

Googling led me to this page from Illumina where they use the ENSG.# format: https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/TPipelineGAFile_fDG.htm

The second thing I noticed was that there seemed to be a LOT of pseudogenes in Descartes. This may be related to the odd naming conventions I described.

 table(genes$gene_type)|> sort(decreasing = TRUE)

Screenshot 2023-04-12 at 13 22 02

Now, part of me is inclined to remove all these pseudogenes. However, "pseudogenes" are a bit of an open question in biology. They can indicate genes which have no function, or genes that have functions that we simply do not understand presently. Therefore, throwing them all out would make me a bit nervous, especially when they account for such a large proportion of the dataset.

Next, we can see that the number of unique ENSG Ids is only slightly larger than the total number of unique gene symbols provided. This probably have a lot to do with the large number of pseudogenes (and well as non-coding transcripts) included in this dataset. But even so, this seems like a crazy number of unique gene symbols. This makes me concerned that perhaps even the gene symbol naming conventions used by the Descartes team may or may not match up with the ones commonly used in other datasets:

lapply(genes,function(x){length(unique(x))})
$gene_id
[1] 63561

$gene_type
[1] 30

$gene_short_name
[1] 56622

So @bobGSmith , your 59608 gene symbols is somewhere in between the number of unique ENGS IDs and the number of unique gene symbols in Descartes. I still suspect there may be artificially-induced gene symbol ".#" suffixes in your CTD given that the scripts weren't really designed for this scenario, but the consequences may not be quite as drastic as the ~2/3 gene signature degradation i originally estimated. Still, we should systematically address this to make sure we're processing the data correctly.

How do we address this?

To explore this further, I tried doing some mapping with orthogene. This will give us a sense of how many of these Descartes-provided gene symbols exist in other databases:

gp <- orthogene::all_genes(species="human",method = "gprofiler")
hg <- orthogene::all_genes(species="human",method = "homologene")
bg <- orthogene::all_genes(species="human",method = "babelgene")

Screenshot 2023-04-12 at 13 31 01

We can see there's a big discrepancy in the total number of genes available in each of these databases. This is likely bc homologene and babelgene are focused exclusively on converting gene symbols across species. While gprofiler converts gene symbols/ensembl IDS/transcripts/proteins/etc within and across species.

So it would seem for the use case of Descartes, gprofiler would be most appropriate. But we still are left with the question of which gene annotation to use:

their ENSG.# notation.
their ENSG.# notation, aggregated after removing ".#" bit.
their ENSG.# notation, aggregated after removing ".#" bit AND standardising gene IDs using gprofiler.
aggregating the expression matrices (sum) using their gene symbol notation.
aggregating the expression matrices (sum) using their gene symbol notation AND standardising gene symbols using gprofiler.

Let's try some of these and see what we get.

When processing this (or any) scRNAseq dataset, we want to consider two things:

Fidelity: after processing the data, how much of the original information do we retain?
Comparability: How comparable are the features in this dataset to other datasets (e.g. other scRNAseq from the same or other species, as well as MAGMA gene annotation files, or HPO gene sets)?

These factors can sometimes be a tradeoff, so we'll want to aim for a final format which is best for the majority of our use cases.

One thing we know for sure is that we need to get everything to human gene symbols as all our downstream analyses depend on that format.

1. aggregate using their provided gene symbols

This would given high fidelity, but questionable comparability as I'm a bit suspicious of their gene annotation practices in general.

2. map their ENSG.# notation to new gene symbols and aggregate

gene_map1 <- orthogene::map_genes(genes$gene_id,species = "human", mthreshold = 1)

Screenshot 2023-04-12 at 14 03 22 Yields 0 results despite the fact that gprofiler (the API map_genes is using under the hood) is super comprehensive with it gene mappings, highlighting my earlier concerns about their choice in naming conventions. That is, unless I'm querying the wrong namespace in gprofiler.

3. map their ENSG.# notation after removing ".#" bit and aggregate

genes$ENSG <- stringr::str_split(genes$gene_id,"\\.",simplify = T)[,1]
gene_map2 <- orthogene::map_genes(genes$ENSG,species = "human", mthreshold = 1)

Screenshot 2023-04-12 at 14 09 37

This approach is able to identify mappings for ~39k of these genes.

4. aggregate using their gene symbols

This would give 56,622 genes, as shown in my previous comment.

5. map their gene symbols to standardised gene symbols

gene_map3 <- orthogene::map_genes(genes$gene_short_name,species = "human", mthreshold = 1)

Screenshot 2023-04-12 at 14 14 39

This approach identifies somewhat fewer genes (~36k) than the ENSG approach.

Now that we've done this, we can check to see how many of these genes from each approach are in the HPO annotations, which is ultimately what we care about.

The HPO annotations have 4555 unique gene symbols in total.

gene_lists <- list(genes$gene_short_name, gene_map2$name, gene_map3$name)
d=HPOExplorer::load_phenotype_to_genes()
lapply(gene_lists, function(g){sum(unique(d$Gene) %in% g)})

Screenshot 2023-04-12 at 14 26 09

From this, we can see that gene_map2 ("map their ENSG.# notation after removing ".#" bit and aggregate") gives us the best coverage of HPO genes. Using the original author-provided gene symbols gives us the worst coverage.

So I've decided that this is the best method for us to move forward with, and have reprocessed the CTD accordingly. Here's the final dimensions of the new DescartesHuman CTD (after dropping uninformative genes): Screenshot 2023-04-12 at 14 33 51

Note that the main 77 celltypes are now stored in level 2 (previously level 1). This is because I've added another level at the level of whole-organs (which is essentially pseudobulk).

Just to check how much this will affect our analyses we can compute some metrics for both the old and new CTDs.

We can compute the proportion of each phenotype's gene list that are present in each CTD.

Old CTD

ctd <- MultiEWCE::load_example_ctd()
ctd_genes <-  rownames(ctd[[1]]$mean_exp)])
pheno_counts <- d[,list(n_genes=length(unique(Gene[Gene %in% ctd_genes ),
                        total_genes=length(unique(Gene))),
                  by="HPO_ID"][,prop_genes:=n_genes/total_genes]
hist(pheno_counts$prop_genes)

Screenshot 2023-04-12 at 15 17 08

New CTD

ctd_genes <-  rownames(ctd2[[2]]$mean_exp)
pheno_counts <- d[,list(n_genes=length(unique(Gene[Gene %in% ctd_genes]) ),
                        total_genes=length(unique(Gene))),
                  by="HPO_ID"][,prop_genes:=n_genes/total_genes]
hist(pheno_counts$prop_genes)

Screenshot 2023-04-12 at 15 48 41

We can see the new CTD increases the gene coverage to >98%, despite having fewer total genes than the old CTD (38259 vs. 59608). This is because the genes that remain in the new CTD have more generalisable gene symbols.

neurogenomics / MSTExplorer

Fix CTD #14