hbctraining / DGE_workshop_salmon

https://hbctraining.github.io/DGE_workshop_salmon/
68 stars 47 forks source link

changing to AnnotationHub #25

Closed mistrm82 closed 5 years ago

mistrm82 commented 5 years ago

The problem we encountered when trying to change to AnnotationHub is the one-to-many mappings of Ensembl to Entrez and the fact that it is stored as a list. Here is some code that will work if we choose to change it. If we change it would be worth exploring the difference between these and Ensv86 using AnnotationDbi.

library(AnnotationHub)
library(ensembldb)

# Connect to AnnotationHub
ah <- AnnotationHub()

# Query AnnotationHub
human_ens <- query(ah, c("Homo sapiens", "EnsDb"))

# Extract annotations of interest
human_ens <- human_ens[["AH64923"]]

# Extract gene-level information
genes(human_ens, return.type = "data.frame") %>% View()

# Create a gene-level dataframe (FOR LESSON)
annotations_ahb <- genes(human_ens, return.type = "data.frame")  %>%
  dplyr::select(gene_id, symbol, entrezid, gene_biotype) %>% 
  dplyr::filter(gene_id %in% res_tableOE_tb$gene)

# Wait a second, we don't have one-to-one mappings!
class(annotations_ahb$entrezid)
which(map(annotations_ahb$entrezid, length) > 1)

# So which one is right? And why do we have this problem?

# Okay let's just keep the first entrezID in the case that there are two mappings
annotations_ahb$entrezid <- map(annotations_ahb$entrezid,1) %>%  unlist()

# Determine the indices for the non-duplicated genes
non_duplicates_idx <- which(duplicated(annotations_ahb$symbol) == FALSE)

# Return only the non-duplicated genes using indices
annotations_ahb<- annotations_ahb[non_duplicates_idx, ]
mistrm82 commented 5 years ago

Re: Multiple entrezIDs mapping to a single Ensembl ID “As for differences between Ensembl and EntrezGene, it was already mentioned in this thread that the CCDS set was constructed to come up with a more unified gene set. Ensembl, UCSC, NCBI and Havana all take part in forming and agreeing on the consensus coding sequences in this set, which currently exists for human and mouse. The latest update, in Sept 2011, shows there are 26,473 CCDS IDs in Human corresponding to 18,471 gene IDs. (CCDS can be splice variants of one gene; ie more than one CCDS can be assigned to a gene).

As for matches between Ensembl and EntrezGene, we know that for the human Ensembl gene set, we have 21,184 links to EntrezGene. We try to get a perfect match when possible. Out of these 21,184 links, 504 genes have more than one EntrezGene entry associated with them. This occurs when we cannot choose a perfect match; ie when we have two good matches, but one does not appear to match with a better percentage than the other. In that case, we assign both matches to the gene/transcript.” https://www.biostars.org/p/16505/

mistrm82 commented 5 years ago

changed, a dataframe is now created for ahb, we can choose to use it or not for FA