grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

Possible bug with getSequence #32

Closed gtollefson closed 3 years ago

gtollefson commented 3 years ago

Hi,

I've built a program which depends upon biomaRt::getSequence() and it appears to return several different sequences when the same query is run for one gene. I am retrieving the exon and intron gene sequence of the human STAT1 gene using HGNC symbol as a query input against the Ensembl database. I've copied my code for reproducibility below. I've also pasted a screenshot of the output I recieve showing that the resulting sequence is of different lengths when run multiple times in succession. The two different sequence lengths output in my pasted example are close to one another in length, but there are instances when it returns only 76kb instead of the ~112kb/~111kb returned in my pasted example.

This does not occur for all genes that I run. But it does also occur with STAT2. I wonder if there are sequences for multiple isoforms saved in Ensembl and they are returned at random without isoform specification? Can you help me to get consistent results with the code I've provided below?

library(biomaRt)

mart <- biomaRt::useDataset(dataset = "hsapiens_gene_ensembl",         
                            mart    = useMart("ENSEMBL_MART_ENSEMBL",       
                                              host    = "www.ensembl.org"))   
gene="STAT1"

nchar(biomaRt::getSequence(id = gene, 
                           type = "hgnc_symbol", 
                           seqType = "gene_exon_intron", 
                           mart = mart))
grimbough commented 3 years ago

duplicate of #33