grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

The timeout=300 on the first line of .submitQueryXML is too short. #22

Closed abelew closed 3 years ago

abelew commented 4 years ago

Querying the transcript lengths with getBM() against an ensembl mart against hsapiens fails with a timeout on the first line of .submitQueryXML. Ideally, the timeout=300 in the httr::POST would be replaced with a parameter that may be modified.

grimbough commented 4 years ago

The 300 second time out has been set to match with the time limit Ensembl impose on the BioMart web interface. The server doesn't tend to perform well if there are many long running queries submitted, so I don't really want to make it user configurable.

Perhaps there's a way to reformulate your query to make it work within the time limit? In my experience queries that exceed the time limit are either being submitted without a filter (essentially a data dump) or have a really large number of attributes (the query engine seems to scale non-linearly as you increase the number of attributes). In both instances there's often a way to work around it in a manageable time scale.

If you post what you're trying to do either here or at the Bioconductor Support Site I'd be happy to try and help.

abelew commented 4 years ago

Greetings, that seems quite sensible. However, I am only asking for the human start positions by transcript ID, thus I find myself tripping the time limit with the relatively stripped down request to eseast.ensembl.org's hsapiens mart:

requests <- c("ensembl_transcript_id", "start_position") starts <- biomaRt::getBM(attributes=requests, mart=ensembl)

I was initially asking for start, end, chromosome, and strand; but figured (as you suggested) that I was asking for too much and so decided to just ask for one column. It seems to me that this should not be too onerous. My current work around is to ask by gene ID rather than transcript, but that seems unsatisfactory.

grimbough commented 4 years ago

Your query has no filter, which means it's basically doing a data dump and the server side BioMart software does't really like it. Generally it works best if you have a filter and no more than 500 values (you can use more than 500 values as biomaRt will automatically chunk anything longer than that).

Here's the trick I use to get this type of query to run, which involves first getting all the gene IDs and then using that in the filter. It's a bit of a ugly work around, but since we ask for all gene IDs it should return everything your original query would have if it could run.

library(biomaRt)
ensembl <- useEnsembl("ensembl", 
                      dataset="hsapiens_gene_ensembl") 

## get all gene ids 
## BioMart can cope with this query without any filters
all_gene_ids <-  getBM(attributes = 'ensembl_gene_id', 
                       mart = ensembl)

## We use all gene ids as values, so we don't miss any data
## but biomaRt will chunk the query automaticall and run much faster
requests <- c("ensembl_transcript_id", "start_position")
starts <- getBM(attributes = requests, 
                filters = "ensembl_gene_id", 
                values = all_gene_ids,
                mart = ensembl)

head(starts)
#>   ensembl_transcript_id start_position
#> 1       ENST00000448773       32628032
#> 2       ENST00000317907       32628032
#> 3       ENST00000647819       32628032
#> 4       ENST00000454690       32628032
#> 5       ENST00000438654       32628032
#> 6       ENST00000433416       32628032
abelew commented 4 years ago

Haha that is nasty, I like it.

On Sat, Aug 15, 2020, 9:53 AM Mike Smith notifications@github.com wrote:

Your query has no filter, which means it's basically doing a data dump and the server side BioMart software does't really like it. Generally it works best if you have a filter and no more than 500 values (you can use more than 500 values as biomaRt will automatically chunk anything longer than that).

Here's the trick I use to get this type of query to run, which involves first getting all the gene IDs and then using that in the filter. It's a bit of a ugly work around, but since we ask for all gene IDs it should return everything your original query would have if it could run.

library(biomaRt)ensembl <- useEnsembl("ensembl", dataset="hsapiens_gene_ensembl")

get all gene ids ## BioMart can cope with this query without any filtersall_gene_ids <- getBM(attributes = 'ensembl_gene_id',

                   mart = ensembl)

We use all gene ids as values, so we don't miss any data## but biomaRt will chunk the query automaticall and run much fasterrequests <- c("ensembl_transcript_id", "start_position")starts <- getBM(attributes = requests,

            filters = "ensembl_gene_id",
            values = all_gene_ids,
            mart = ensembl)

head(starts)#> ensembl_transcript_id start_position#> 1 ENST00000448773 32628032#> 2 ENST00000317907 32628032#> 3 ENST00000647819 32628032#> 4 ENST00000454690 32628032#> 5 ENST00000438654 32628032#> 6 ENST00000433416 32628032

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/grimbough/biomaRt/issues/22#issuecomment-674398911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQB3NNCRR3KGHPJ3FFPGLSA2HNRANCNFSM4PYGN5AA .