grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

Possible bug in getSequence() #33

Open gtollefson opened 3 years ago

gtollefson commented 3 years ago

Hi,

I've built a program which depends upon biomaRt::getSequence() and it appears to return several different sequences when the same query is run for one gene. I am retrieving the exon and intron gene sequence of the human STAT1 gene using HGNC symbol as a query input against the Ensembl database. I've copied my code for reproducibility below. I've also pasted a screenshot of the output I recieve showing that the resulting sequence is of different lengths when run multiple times in succession. The two different sequence lengths output in my pasted example are close to one another in length, but there are instances when it returns only 76kb instead of the ~112kb/~111kb returned in my pasted example.

This does not occur for all genes that I run. But it does also occur with STAT2. I wonder if there are sequences for multiple isoforms saved in Ensembl and they are returned at random without isoform specification? Can you help me to get consistent results with the code I've provided below?

library(biomaRt)

mart <- biomaRt::useDataset(dataset = "hsapiens_gene_ensembl",         
                            mart    = useMart("ENSEMBL_MART_ENSEMBL",       
                                              host    = "www.ensembl.org"))   
gene="STAT1"

nchar(biomaRt::getSequence(id = gene, 
                           type = "hgnc_symbol", 
                           seqType = "gene_exon_intron", 
                           mart = mart))
Screen Shot 2020-11-02 at 12 16 29 PM
grimbough commented 3 years ago

Thanks for the report, and I can confirm that I see the same behaviour. I think this is probably because Ensembl BioMart is orientated around Ensembl IDs and it isn't supposed to allow getting back a sequence and the HGNC symbol in a single query. If you're using the web interface the option simply isn't available.

image

We can also check this programmatically in biomaRt by searching for the available attributes. The two attributes getSequence() is requesting are on different "pages" and you shouldn't be able to mix attributes across pages in a single query.

searchAttributes(mart, pattern = "hgnc")
##                name        description         page
##  63         hgnc_id            HGNC ID feature_page
##  64     hgnc_symbol        HGNC symbol feature_page
##  94 hgnc_trans_name Transcript name ID feature_page

searchAttributes(mart, pattern = "gene_exon_intron")
##                   name      description      page
##  3002 gene_exon_intron Unspliced (Gene) sequences

Nonetheless it clearly works sometimes, and biomaRt doesn't stop it happening, but I suspect this is cause of the inconsistencies.

My suggestion would be to convert your HGNC IDs to Ensembl Gene IDs and then request the sequences. There are lots of ways to do that mapping, and it will never be perfect 1-to-1, but with biomaRt something like this should work:

id_map <- getBM(attributes = c("hgnc_symbol", "ensembl_gene_id"), 
          filters = "hgnc_symbol", 
          values = "STAT1", 
          mart = mart)
id_map
##    hgnc_symbol ensembl_gene_id
##  1       STAT1 ENSG00000115415

Then use the Ensembl ID for querying:

getSequence(id = id_map[,"ensembl_gene_id"], 
            type = "ensembl_gene_id", 
            seqType = "gene_exon_intron", 
            mart = mart)

If I run that 50 times, I get a consistent length:

tmp <- lapply(rep("ENSG00000115415", 50), FUN = function(x) {
    ## this line makes sure I'm not retrieving a cache result already stored on the local disk
    biomartCacheClear()
    nchar(biomaRt::getSequence(id = x, 
                               type = "ensembl_gene_id", 
                               seqType = "gene_exon_intron", 
                               mart = mart))
})
table(do.call("rbind", tmp)[,1])
##  112501 
##      50 

I see the hgnc_symbol example is in the biomaRt documentation. Since there are some clear problems with this approach I will update the documentation and the code to warn about this potentially happening in the future. It seems likely that many users would not be as observant as you and miss potentially incorrect results coming back.

gtollefson commented 3 years ago

@grimbough Ah ha! I see, thank you for explaining.

However when I run the solution code you provide I get the following error:

Error in is.single.string(object) : argument "object" is missing, with no default

When I run it with the same ensembl id by assigning it to a variable, I get the same error.

STAT1_id = "ENSG00000115415"

getSequence(id =STAT1_id, 
            type = "ensembl_gene_id", 
            seqType = "gene_exon_intron", 
            mart = mart)

Error in is.single.string(object) : argument "object" is missing, with no default

I am using the following biomaRt versions and R versions:

Screen Shot 2020-11-05 at 2 10 15 PM Screen Shot 2020-11-05 at 2 10 04 PM

Do you have any suggestions to help me run getSequence usingensembl_gene_id as a type?

gtollefson commented 3 years ago

@grimbough My mistake, the seqinr library masked the getSequence function and was the source of the second issue. I solved it using biomaRt::getSequence instead of getSequence.

Thank you very much for all of your help!

grimbough commented 3 years ago

I've just taken another look at this, and my previous comments are only partly correct. Via the web interface you should be able to filter using HGNC ID, but are not able to include that in the header that is returned via the attributes. (I have no idea why that's the case, the must be an internal Ensembl BioMart reason).

The getSequence() function is too simplistic, and assumes anything you can filter on can also be returned. I'm going to amend the code to reflect the real status.