Closed Ci-TJ closed 4 years ago
Differences between genome releases/genome build names can be a problem indeed. But the problem in your code is that you are using UCSC-style chromosome names ("chr1"
), while ensembldb
expects chromosome names in Ensembl-style ("1"
). Or better said, EnsDb
databases contain annotations provided by Ensembl and thus chromosome names without the "chr"
are used.
I would suggest that you follow the examples in the ensembldb vignette describing how to extract transcript sequences (here):
library(EnsDb.Mmusculus.v79)
edb <- EnsDb.Mmusculus.v79
## Get the genomic sequence of the genome build matching the
## one from edb as good as possible
dna <- getGenomeTwoBitFile(edb)
## Use Ensembl-style chromosome names, such as Y instead of chrY
y_tx <- exonsBy(edb, filter = ~ seq_name == "Y")
y_tx
GRangesList object of length 2243:
$ENSMUST00000055032
GRanges object with 26 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_rank
<Rle> <IRanges> <Rle> | <character> <integer>
[1] Y 897790-898115 + | ENSMUSE00001319692 1
[2] Y 898550-898627 + | ENSMUSE00000568662 2
[3] Y 899428-899550 + | ENSMUSE00000568661 3
[4] Y 899787-899957 + | ENSMUSE00000568660 4
[5] Y 900479-900613 + | ENSMUSE00000568659 5
... ... ... ... . ... ...
[22] Y 940905-941042 + | ENSMUSE00000568628 22
[23] Y 941227-941823 + | ENSMUSE00000568627 23
[24] Y 942231-942312 + | ENSMUSE00000568626 24
[25] Y 942442-942641 + | ENSMUSE00000568625 25
[26] Y 942824-946316 + | ENSMUSE00001335494 26
-------
seqinfo: 1 sequence from GRCm38 genome
...
<2242 more elements>
y_seqs <- extractTranscriptSeqs(dna, y_tx)
y_seqs
A DNAStringSet instance of length 2243
width seq names
[1] 7974 ATTGTGATCTAAAAATGGTGGA...ATAAACCCTTTGCTCCCCAAA ENSMUST00000055032
[2] 2783 GAACAGAAAGACTGGTGAACAT...TGAGAAATAAATTTTAACAAC ENSMUST00000065545
[3] 5208 TGCGGTTCACACTGACGGTATT...TTTATTATTAGTAAATCTTAA ENSMUST00000069309
[4] 338 ATGTCCACTATCCTGAACCTCA...CAGCTGATGGTTCATGGGTTT ENSMUST00000084813
[5] 1188 ATGGAGGGCCATGTCAAGCGCC...CTGTGGTTGGCAGTCTCATGA ENSMUST00000091178
... ... ...
[2239] 2169 GTCAAACTCCCCTGCCCTGCTA...TATCCTGCCACCAAAGCTATG ENSMUST00000195837
[2240] 2978 GACTTTCTCCTAAGAAATAAAT...ACCTGGAGTTTACTGGGAAGT ENSMUST00000195864
[2241] 431 GGCCTTGCTCTTATGTCATCTG...ATCATCTTTAGGTCACTCATC ENSMUST00000195871
[2242] 621 GGCCTTGCTCTTATGTCATCTG...ATCTTTATGTCACTCATCACC ENSMUST00000195874
[2243] 1499 ATTCTCTATCTGATGGAGGGCA...ATATAAATTTTATGTGTAAAA ENSMUST00000195885
Differences between genome releases/genome build names can be a problem indeed. But the problem in your code is that you are using UCSC-style chromosome names (
"chr1"
), whileensembldb
expects chromosome names in Ensembl-style ("1"
). Or better said,EnsDb
databases contain annotations provided by Ensembl and thus chromosome names without the"chr"
are used.I would suggest that you follow the examples in the ensembldb vignette describing how to extract transcript sequences (here):
library(EnsDb.Mmusculus.v79) edb <- EnsDb.Mmusculus.v79 ## Get the genomic sequence of the genome build matching the ## one from edb as good as possible dna <- getGenomeTwoBitFile(edb) ## Use Ensembl-style chromosome names, such as Y instead of chrY y_tx <- exonsBy(edb, filter = ~ seq_name == "Y") y_tx GRangesList object of length 2243: $ENSMUST00000055032 GRanges object with 26 ranges and 2 metadata columns: seqnames ranges strand | exon_id exon_rank <Rle> <IRanges> <Rle> | <character> <integer> [1] Y 897790-898115 + | ENSMUSE00001319692 1 [2] Y 898550-898627 + | ENSMUSE00000568662 2 [3] Y 899428-899550 + | ENSMUSE00000568661 3 [4] Y 899787-899957 + | ENSMUSE00000568660 4 [5] Y 900479-900613 + | ENSMUSE00000568659 5 ... ... ... ... . ... ... [22] Y 940905-941042 + | ENSMUSE00000568628 22 [23] Y 941227-941823 + | ENSMUSE00000568627 23 [24] Y 942231-942312 + | ENSMUSE00000568626 24 [25] Y 942442-942641 + | ENSMUSE00000568625 25 [26] Y 942824-946316 + | ENSMUSE00001335494 26 ------- seqinfo: 1 sequence from GRCm38 genome ... <2242 more elements> y_seqs <- extractTranscriptSeqs(dna, y_tx) y_seqs A DNAStringSet instance of length 2243 width seq names [1] 7974 ATTGTGATCTAAAAATGGTGGA...ATAAACCCTTTGCTCCCCAAA ENSMUST00000055032 [2] 2783 GAACAGAAAGACTGGTGAACAT...TGAGAAATAAATTTTAACAAC ENSMUST00000065545 [3] 5208 TGCGGTTCACACTGACGGTATT...TTTATTATTAGTAAATCTTAA ENSMUST00000069309 [4] 338 ATGTCCACTATCCTGAACCTCA...CAGCTGATGGTTCATGGGTTT ENSMUST00000084813 [5] 1188 ATGGAGGGCCATGTCAAGCGCC...CTGTGGTTGGCAGTCTCATGA ENSMUST00000091178 ... ... ... [2239] 2169 GTCAAACTCCCCTGCCCTGCTA...TATCCTGCCACCAAAGCTATG ENSMUST00000195837 [2240] 2978 GACTTTCTCCTAAGAAATAAAT...ACCTGGAGTTTACTGGGAAGT ENSMUST00000195864 [2241] 431 GGCCTTGCTCTTATGTCATCTG...ATCATCTTTAGGTCACTCATC ENSMUST00000195871 [2242] 621 GGCCTTGCTCTTATGTCATCTG...ATCTTTATGTCACTCATCACC ENSMUST00000195874 [2243] 1499 ATTCTCTATCTGATGGAGGGCA...ATATAAATTTTATGTGTAAAA ENSMUST00000195885
Hi, Thank you for your reply, but I suffer problems on the way:
library(ensembldb) library(EnsDb.Mmusculus.v79) edbMMU<- EnsDb.Mmusculus.v79 dna<- getGenomeTwoBitFile(edbMMU) No internet connection using 'localHub=TRUE' Error in .updateHubDB(hub_bfc, .class, url, proxy, localHub) : Invalid Cache: sqlite file Hub has not been added to cache Run again with 'localHub=FALSE'
then,
library(AnnotationHub) ah <- AnnotationHub() No internet connection using 'localHub=TRUE' Error in .updateHubDB(hub_bfc, .class, url, proxy, localHub) : Invalid Cache: sqlite file Hub has not been added to cache Run again with 'localHub=FALSE'
Is anything wrong with this?
@jorainer Hi, Thank you for your reply, but I suffer problems on the way:
library(ensembldb) library(EnsDb.Mmusculus.v79) edbMMU<- EnsDb.Mmusculus.v79 dna<- getGenomeTwoBitFile(edbMMU) No internet connection using 'localHub=TRUE' Error in .updateHubDB(hub_bfc, .class, url, proxy, localHub) : Invalid Cache: sqlite file Hub has not been added to cache Run again with 'localHub=FALSE'
then,
library(AnnotationHub) ah <- AnnotationHub() No internet connection using 'localHub=TRUE' Error in .updateHubDB(hub_bfc, .class, url, proxy, localHub) : Invalid Cache: sqlite file Hub has not been added to cache Run again with 'localHub=FALSE'
Is anything wrong with this?
Seems that you have no internet connection from the computer running R?
The dna<- getGenomeTwoBitFile(edbMMU)
will try to retrieve the genomic sequence from the internet, more specifically from the AnnotationHub
from Bioconductor.
@jorainer
I don't understand that internet connection from the PC running R, which maybe I don't skill in R, but I could install the packages by install.packages()
. maybe, I need to read the AnnotationHub manual.
Thanks!
@jorainer Sorry for disturbing you again, I tried to annotate the biotype for the transcript that could be found in Ensembl, but I found many was missing.
library(ensembldb) library(EnsDb.Mmusculus.v79) MMUedb<- EnsDb.Mmusculus.v79 nrow(sigDiffT_GSE66630) [1] 456 BioType_sigDiffT_GSE66630<- select(MMUedb, keys = as.character(sigDiffT_GSE66630[,"transcriptNames"]), keytype = "TXID",columns = c("GENEID","TXID","TXBIOTYPE")) nrow(BioType_sigDiffT_GSE66630) [1] 383 head(BioType_sigDiffT_GSE66630) GENEID TXID TXBIOTYPE 1 ENSMUSG00000020676 ENSMUST00000000342 protein_coding 2 ENSMUSG00000000605 ENSMUST00000000619 protein_coding 3 ENSMUSG00000000753 ENSMUST00000000769 protein_coding 4 ENSMUSG00000001025 ENSMUST00000001051 protein_coding 5 ENSMUSG00000001248 ENSMUST00000001280 protein_coding 6 ENSMUSG00000001518 ENSMUST00000001559 protein_coding select(MMUedb, keys = as.character("ENSMUST00000000619"), keytype = "TXID",columns = c("GENEID","TXID","TXBIOTYPE")) GENEID TXID TXBIOTYPE 1 ENSMUSG00000000605 ENSMUST00000000619 protein_coding select(MMUedb, keys = as.character("ENSMUST00000202681"), keytype = "TXID",columns = c("GENEID","TXID","TXBIOTYPE")) [1] GENEID TXID TXBIOTYPE <0 行> (或0-长度的row.names) select(MMUedb, keys = as.character("ENSMUST00000209350"), keytype = "TXID",columns = c("GENEID","TXID","TXBIOTYPE")) [1] GENEID TXID TXBIOTYPE <0 行> (或0-长度的row.names)
Did it occur owing to the "EnsDb.Mmusculus.v79", or a bug ?
Ci
Eventually some of the transcript IDs are not present in the EnsDb
(maybe because the data set used a more recent Ensembl release than 79 on which the EnsDb
you used is based on)?
To check you could try to do the following:
edb <- EnsDb.Mmusculus.v79
txs <- transcripts(edb, return.type = "data.frame")
wantIds <- sigDiffT_GSE66630[, "transcriptName"]
missingIds <- wantIds[!(wantIds %in% txs$tx_id)]
length(missingIds)
This would tell you if, and how many, of your IDs in the sigDiffT...
object are not available in the EnsDb
database.
@jorainer I sorry for forgetting the version of the Ensemble genome I used. I aligned the dataset based on the Ensembl release-98, and this may be the reason. I found the ensembldb is great tool, and I will continue learning on it. Thanks you for your help during this time!
Note that you can get EnsDb
for more recent Ensembl versions from AnnotationHub
- that's also the preferred way to install/use EnsDb
databases (instead of the EnsDb
packages):
> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2019-05-02
> query(ah, "EnsDb")
AnnotationHub with 1321 records
# snapshotDate(): 2019-05-02
# $dataprovider: Ensembl
# $species: Homo sapiens, Ailuropoda melanoleuca, Anolis carolinensis, Astya...
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53185"]]'
title
AH53185 | Ensembl 87 EnsDb for Anolis Carolinensis
AH53186 | Ensembl 87 EnsDb for Ailuropoda Melanoleuca
... ...
AH75116 | Ensembl 98 EnsDb for Xenopus tropicalis
AH75117 | Ensembl 98 EnsDb for Zonotrichia albicollis
## To get the EnsDb for Mus musculus and Ensembl release 98:
> query(ah, "EnsDb.Mmusculus.v98")
AnnotationHub with 1 record
# snapshotDate(): 2019-05-02
# names(): AH75036
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: EnsDb
# $rdatadateadded: 2019-05-02
# $title: Ensembl 98 EnsDb for Mus musculus
# $description: Gene and protein annotations for Mus musculus based on Ensem...
# $taxonomyid: 10090
# $genome: GRCm38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("98", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
# "Protein", "Transcript")
# retrieve record with 'object[["AH75036"]]'
> edb <- ah[["AH75036"]]
Thank you very much from my deep heart. your suggetions are very helpful for me. I just fixed the problem of AnnotationHub, and I actualized your suggesting commands on my computer. Best.
Hi, I want to get transcript sequences by using ensembldb, but it not available for mouse transcripts! how should I deal with it?
**> library(EnsDb.Mmusculus.v79)
Best, Ci