jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

getting the mouse transcript sequence by ensembldb #102

Closed Ci-TJ closed 4 years ago

Ci-TJ commented 5 years ago

Hi, I want to get transcript sequences by using ensembldb, but it not available for mouse transcripts! how should I deal with it?

**> library(EnsDb.Mmusculus.v79)

edbMMU<- EnsDb.Mmusculus.v79 library(BSgenome.Mmusculus.UCSC.mm10) bsgMMU<- BSgenome.Mmusculus.UCSC.mm10 unique(genome(edbMMU)) [1] "GRCm38" unique(genome(bsgMMU)) [1] "mm10" yTxSeqs <- extractTranscriptSeqs(bsgMMU, exonsBy(edbMMU, "tx", filter = SeqNameFilter("chr1"))) head(yTxSeqs) A DNAStringSet instance of length 0 **

  Is the genome version causing this error? or is not available for achieving it by using ensembldb?

Best, Ci

jorainer commented 5 years ago

Differences between genome releases/genome build names can be a problem indeed. But the problem in your code is that you are using UCSC-style chromosome names ("chr1"), while ensembldb expects chromosome names in Ensembl-style ("1"). Or better said, EnsDb databases contain annotations provided by Ensembl and thus chromosome names without the "chr" are used.

I would suggest that you follow the examples in the ensembldb vignette describing how to extract transcript sequences (here):

library(EnsDb.Mmusculus.v79)
edb <- EnsDb.Mmusculus.v79

## Get the genomic sequence of the genome build matching the
## one from edb as good as possible
dna <- getGenomeTwoBitFile(edb)

## Use Ensembl-style chromosome names, such as Y instead of chrY
y_tx <- exonsBy(edb, filter = ~ seq_name == "Y")
y_tx
GRangesList object of length 2243:
$ENSMUST00000055032
GRanges object with 26 ranges and 2 metadata columns:
       seqnames        ranges strand |            exon_id exon_rank
          <Rle>     <IRanges>  <Rle> |        <character> <integer>
   [1]        Y 897790-898115      + | ENSMUSE00001319692         1
   [2]        Y 898550-898627      + | ENSMUSE00000568662         2
   [3]        Y 899428-899550      + | ENSMUSE00000568661         3
   [4]        Y 899787-899957      + | ENSMUSE00000568660         4
   [5]        Y 900479-900613      + | ENSMUSE00000568659         5
   ...      ...           ...    ... .                ...       ...
  [22]        Y 940905-941042      + | ENSMUSE00000568628        22
  [23]        Y 941227-941823      + | ENSMUSE00000568627        23
  [24]        Y 942231-942312      + | ENSMUSE00000568626        24
  [25]        Y 942442-942641      + | ENSMUSE00000568625        25
  [26]        Y 942824-946316      + | ENSMUSE00001335494        26
  -------
  seqinfo: 1 sequence from GRCm38 genome

...
<2242 more elements>

y_seqs <- extractTranscriptSeqs(dna, y_tx)
y_seqs
  A DNAStringSet instance of length 2243
       width seq                                            names               
   [1]  7974 ATTGTGATCTAAAAATGGTGGA...ATAAACCCTTTGCTCCCCAAA ENSMUST00000055032
   [2]  2783 GAACAGAAAGACTGGTGAACAT...TGAGAAATAAATTTTAACAAC ENSMUST00000065545
   [3]  5208 TGCGGTTCACACTGACGGTATT...TTTATTATTAGTAAATCTTAA ENSMUST00000069309
   [4]   338 ATGTCCACTATCCTGAACCTCA...CAGCTGATGGTTCATGGGTTT ENSMUST00000084813
   [5]  1188 ATGGAGGGCCATGTCAAGCGCC...CTGTGGTTGGCAGTCTCATGA ENSMUST00000091178
   ...   ... ...
[2239]  2169 GTCAAACTCCCCTGCCCTGCTA...TATCCTGCCACCAAAGCTATG ENSMUST00000195837
[2240]  2978 GACTTTCTCCTAAGAAATAAAT...ACCTGGAGTTTACTGGGAAGT ENSMUST00000195864
[2241]   431 GGCCTTGCTCTTATGTCATCTG...ATCATCTTTAGGTCACTCATC ENSMUST00000195871
[2242]   621 GGCCTTGCTCTTATGTCATCTG...ATCTTTATGTCACTCATCACC ENSMUST00000195874
[2243]  1499 ATTCTCTATCTGATGGAGGGCA...ATATAAATTTTATGTGTAAAA ENSMUST00000195885
Ci-TJ commented 5 years ago

Differences between genome releases/genome build names can be a problem indeed. But the problem in your code is that you are using UCSC-style chromosome names ("chr1"), while ensembldb expects chromosome names in Ensembl-style ("1"). Or better said, EnsDb databases contain annotations provided by Ensembl and thus chromosome names without the "chr" are used.

I would suggest that you follow the examples in the ensembldb vignette describing how to extract transcript sequences (here):

library(EnsDb.Mmusculus.v79)
edb <- EnsDb.Mmusculus.v79

## Get the genomic sequence of the genome build matching the
## one from edb as good as possible
dna <- getGenomeTwoBitFile(edb)

## Use Ensembl-style chromosome names, such as Y instead of chrY
y_tx <- exonsBy(edb, filter = ~ seq_name == "Y")
y_tx
GRangesList object of length 2243:
$ENSMUST00000055032
GRanges object with 26 ranges and 2 metadata columns:
       seqnames        ranges strand |            exon_id exon_rank
          <Rle>     <IRanges>  <Rle> |        <character> <integer>
   [1]        Y 897790-898115      + | ENSMUSE00001319692         1
   [2]        Y 898550-898627      + | ENSMUSE00000568662         2
   [3]        Y 899428-899550      + | ENSMUSE00000568661         3
   [4]        Y 899787-899957      + | ENSMUSE00000568660         4
   [5]        Y 900479-900613      + | ENSMUSE00000568659         5
   ...      ...           ...    ... .                ...       ...
  [22]        Y 940905-941042      + | ENSMUSE00000568628        22
  [23]        Y 941227-941823      + | ENSMUSE00000568627        23
  [24]        Y 942231-942312      + | ENSMUSE00000568626        24
  [25]        Y 942442-942641      + | ENSMUSE00000568625        25
  [26]        Y 942824-946316      + | ENSMUSE00001335494        26
  -------
  seqinfo: 1 sequence from GRCm38 genome

...
<2242 more elements>

y_seqs <- extractTranscriptSeqs(dna, y_tx)
y_seqs
  A DNAStringSet instance of length 2243
       width seq                                            names               
   [1]  7974 ATTGTGATCTAAAAATGGTGGA...ATAAACCCTTTGCTCCCCAAA ENSMUST00000055032
   [2]  2783 GAACAGAAAGACTGGTGAACAT...TGAGAAATAAATTTTAACAAC ENSMUST00000065545
   [3]  5208 TGCGGTTCACACTGACGGTATT...TTTATTATTAGTAAATCTTAA ENSMUST00000069309
   [4]   338 ATGTCCACTATCCTGAACCTCA...CAGCTGATGGTTCATGGGTTT ENSMUST00000084813
   [5]  1188 ATGGAGGGCCATGTCAAGCGCC...CTGTGGTTGGCAGTCTCATGA ENSMUST00000091178
   ...   ... ...
[2239]  2169 GTCAAACTCCCCTGCCCTGCTA...TATCCTGCCACCAAAGCTATG ENSMUST00000195837
[2240]  2978 GACTTTCTCCTAAGAAATAAAT...ACCTGGAGTTTACTGGGAAGT ENSMUST00000195864
[2241]   431 GGCCTTGCTCTTATGTCATCTG...ATCATCTTTAGGTCACTCATC ENSMUST00000195871
[2242]   621 GGCCTTGCTCTTATGTCATCTG...ATCTTTATGTCACTCATCACC ENSMUST00000195874
[2243]  1499 ATTCTCTATCTGATGGAGGGCA...ATATAAATTTTATGTGTAAAA ENSMUST00000195885

Hi, Thank you for your reply, but I suffer problems on the way:

library(ensembldb) library(EnsDb.Mmusculus.v79) edbMMU<- EnsDb.Mmusculus.v79 dna<- getGenomeTwoBitFile(edbMMU) No internet connection using 'localHub=TRUE' Error in .updateHubDB(hub_bfc, .class, url, proxy, localHub) : Invalid Cache: sqlite file Hub has not been added to cache Run again with 'localHub=FALSE'

then,

library(AnnotationHub) ah <- AnnotationHub() No internet connection using 'localHub=TRUE' Error in .updateHubDB(hub_bfc, .class, url, proxy, localHub) : Invalid Cache: sqlite file Hub has not been added to cache Run again with 'localHub=FALSE'

Is anything wrong with this?

Ci-TJ commented 5 years ago

@jorainer Hi, Thank you for your reply, but I suffer problems on the way:

library(ensembldb) library(EnsDb.Mmusculus.v79) edbMMU<- EnsDb.Mmusculus.v79 dna<- getGenomeTwoBitFile(edbMMU) No internet connection using 'localHub=TRUE' Error in .updateHubDB(hub_bfc, .class, url, proxy, localHub) : Invalid Cache: sqlite file Hub has not been added to cache Run again with 'localHub=FALSE'

then,

library(AnnotationHub) ah <- AnnotationHub() No internet connection using 'localHub=TRUE' Error in .updateHubDB(hub_bfc, .class, url, proxy, localHub) : Invalid Cache: sqlite file Hub has not been added to cache Run again with 'localHub=FALSE'

Is anything wrong with this?

jorainer commented 5 years ago

Seems that you have no internet connection from the computer running R?

The dna<- getGenomeTwoBitFile(edbMMU) will try to retrieve the genomic sequence from the internet, more specifically from the AnnotationHub from Bioconductor.

Ci-TJ commented 5 years ago

@jorainer I don't understand that internet connection from the PC running R, which maybe I don't skill in R, but I could install the packages by install.packages(). maybe, I need to read the AnnotationHub manual. Thanks!

Ci-TJ commented 5 years ago

@jorainer Sorry for disturbing you again, I tried to annotate the biotype for the transcript that could be found in Ensembl, but I found many was missing.

library(ensembldb) library(EnsDb.Mmusculus.v79) MMUedb<- EnsDb.Mmusculus.v79 nrow(sigDiffT_GSE66630) [1] 456 BioType_sigDiffT_GSE66630<- select(MMUedb, keys = as.character(sigDiffT_GSE66630[,"transcriptNames"]), keytype = "TXID",columns = c("GENEID","TXID","TXBIOTYPE")) nrow(BioType_sigDiffT_GSE66630) [1] 383 head(BioType_sigDiffT_GSE66630) GENEID TXID TXBIOTYPE 1 ENSMUSG00000020676 ENSMUST00000000342 protein_coding 2 ENSMUSG00000000605 ENSMUST00000000619 protein_coding 3 ENSMUSG00000000753 ENSMUST00000000769 protein_coding 4 ENSMUSG00000001025 ENSMUST00000001051 protein_coding 5 ENSMUSG00000001248 ENSMUST00000001280 protein_coding 6 ENSMUSG00000001518 ENSMUST00000001559 protein_coding select(MMUedb, keys = as.character("ENSMUST00000000619"), keytype = "TXID",columns = c("GENEID","TXID","TXBIOTYPE")) GENEID TXID TXBIOTYPE 1 ENSMUSG00000000605 ENSMUST00000000619 protein_coding select(MMUedb, keys = as.character("ENSMUST00000202681"), keytype = "TXID",columns = c("GENEID","TXID","TXBIOTYPE")) [1] GENEID TXID TXBIOTYPE <0 行> (或0-长度的row.names) select(MMUedb, keys = as.character("ENSMUST00000209350"), keytype = "TXID",columns = c("GENEID","TXID","TXBIOTYPE")) [1] GENEID TXID TXBIOTYPE <0 行> (或0-长度的row.names)

Did it occur owing to the "EnsDb.Mmusculus.v79", or a bug ?

Ci

jorainer commented 4 years ago

Eventually some of the transcript IDs are not present in the EnsDb (maybe because the data set used a more recent Ensembl release than 79 on which the EnsDb you used is based on)?

To check you could try to do the following:

edb <- EnsDb.Mmusculus.v79
txs <- transcripts(edb, return.type = "data.frame")

wantIds <- sigDiffT_GSE66630[, "transcriptName"]

missingIds <- wantIds[!(wantIds %in% txs$tx_id)]
length(missingIds)

This would tell you if, and how many, of your IDs in the sigDiffT... object are not available in the EnsDb database.

Ci-TJ commented 4 years ago

@jorainer I sorry for forgetting the version of the Ensemble genome I used. I aligned the dataset based on the Ensembl release-98, and this may be the reason. I found the ensembldb is great tool, and I will continue learning on it. Thanks you for your help during this time!

jorainer commented 4 years ago

Note that you can get EnsDb for more recent Ensembl versions from AnnotationHub - that's also the preferred way to install/use EnsDb databases (instead of the EnsDb packages):

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2019-05-02
> query(ah, "EnsDb")
AnnotationHub with 1321 records
# snapshotDate(): 2019-05-02 
# $dataprovider: Ensembl
# $species: Homo sapiens, Ailuropoda melanoleuca, Anolis carolinensis, Astya...
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53185"]]' 

            title                                      
  AH53185 | Ensembl 87 EnsDb for Anolis Carolinensis   
  AH53186 | Ensembl 87 EnsDb for Ailuropoda Melanoleuca
  ...       ...                                        
  AH75116 | Ensembl 98 EnsDb for Xenopus tropicalis    
  AH75117 | Ensembl 98 EnsDb for Zonotrichia albicollis

## To get the EnsDb for Mus musculus and Ensembl release 98:
> query(ah, "EnsDb.Mmusculus.v98")
AnnotationHub with 1 record
# snapshotDate(): 2019-05-02 
# names(): AH75036
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: EnsDb
# $rdatadateadded: 2019-05-02
# $title: Ensembl 98 EnsDb for Mus musculus
# $description: Gene and protein annotations for Mus musculus based on Ensem...
# $taxonomyid: 10090
# $genome: GRCm38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("98", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
#   "Protein", "Transcript") 
# retrieve record with 'object[["AH75036"]]' 
> edb <- ah[["AH75036"]]
Ci-TJ commented 4 years ago

Thank you very much from my deep heart. your suggetions are very helpful for me. I just fixed the problem of AnnotationHub, and I actualized your suggesting commands on my computer. Best.