A simple diff should reveal the changes I made, but if you have questions please let me know.
Best,
Leonardo
Un-evaluated code
library('GenomicFeatures')
library('rtracklayer')
library('bumphunter')
library('org.Hs.eg.db')
library('devtools')
## Get the raw data
gr <- import('ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz')
## Subset and add the chromosome length info
gr_small <- keepSeqlevels(gr, c('Y', 'MT'), pruning.mode = 'tidy')
hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38")
new_info <- hg38_chrominfo$UCSC_seqlength[match(seqlevels(gr_small),
mapSeqlevels(hg38_chrominfo$UCSC_seqlevel, 'Ensembl'))]
## Set the chromosome lengths, genome
seqlengths(gr_small) <- new_info
genome(gr_small) <- 'hg38'
## Create the TxDb object for gencode v25 (just chrs Y and MT)
txdb <- makeTxDbFromGRanges(gr_small)
## Run annotate transcripts
ann <- annotateTranscripts(txdb, annotationPackage = 'org.Hs.eg.db')
## annotateTranscripts() assumes that the ids are from ENTREZ
ann
ann$Entrez
ann$Gene
ann$Refseq
## Current code:
# https://github.com/Bioconductor-mirror/bumphunter/blob/515c9c45a6a1b1fa03d1936fb946d8fbea184710/R/matchGenes.R#L182-L202
## Use new code with default arguments
## Code at https://gist.github.com/lcolladotor/a614eadf1c85ee73c0a6be9c5b83754a
ann2 <- annotateTranscripts_fix(txdb, annotationPackage = 'org.Hs.eg.db')
ann2
## The only change is the name of the column with the gene ids.
identical(ann$Entrez, ann2$Geneid)
identical(ann$Gene, ann2$Gene)
identical(ann$Refseq, ann2$Refseq)
## Now specify the options to be used in the mapIds()
ann3 <- annotateTranscripts_fix(txdb, annotationPackage = 'org.Hs.eg.db', mappingInfo = list('column' = 'ENTREZID', 'keytype' = 'ENSEMBL', 'multiVals' = 'first'))
ann3
identical(ann$Entrez, ann3$Geneid)
## Gene symbols are there now
ann3$Gene
## Refseq section is complicated though
ann3$Refseq[1]
## Use the new version of matchGenes() that uses 'Geneid' instead of 'Entrez'
## Code at https://gist.github.com/lcolladotor/a614eadf1c85ee73c0a6be9c5b83754a
genes <- matchGenes_fix(ann3[1:6], ann3)
genes
## Reproducibility information
Sys.time()
options(width = 120)
session_info()
Hi,
As you can see at https://support.bioconductor.org/p/93235/#93397, we have some users interested in using TxDb objects created with Gencode v25 data instead of the UCSC ones that exist in Bioconductor. As they are written right now,
matchGenes()
andannotateTranscripts()
assume that the gene ids are ENTREZIDs. A couple changes make these functions more flexible. I'm posting the changes in a gist https://gist.github.com/lcolladotor/a614eadf1c85ee73c0a6be9c5b83754a instead of a pull request since you might prefer to use the...
argument or do other things. The main change is https://gist.github.com/lcolladotor/a614eadf1c85ee73c0a6be9c5b83754a#file-annotatetranscripts_fix-r-L94-L97 where I leave the code pretty general since I know that these functions can be used with multiple annotation packages.A simple
diff
should reveal the changes I made, but if you have questions please let me know.Best, Leonardo
Un-evaluated code
Evaluated code