grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

curl timeout for simple (but large) queries #20

Closed mschubert closed 4 years ago

mschubert commented 4 years ago

This was supposedly fixed in #3, but I continue to have some timeout issues. Would it be possible to make timeouts configurable? (e.g. via options)

I'm trying to get all human exon coordinates, and it times out every time

ensembl = biomaRt::useEnsembl("ensembl", dataset="hsapiens_gene_ensembl", GRCh=37,
                              host = "useast.ensembl.org")
res = biomaRt::getBM(attributes=c("ensembl_gene_id", "ensembl_exon_id",
                                  "exon_chrom_start", "exon_chrom_end",
                                  "chromosome_name", "gene_exon"),
                     mart=ensembl)

Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached: Operation timed out after 300001 milliseconds with 4054861 bytes received

Note that the number of bytes changes, and US east is the closest mirror (also fails for main Ensembl instance)

Using biomaRt_2.36.1, curl_4.0

grimbough commented 4 years ago

I don't want to change the timeout argument as 5 minutes is the limit imposed by the Ensembl web interface, and part of the cause of the recent slowdowns is an increasing number of long running queries from other sources. biomaRt isn't really designed for bulk downloading of data as you're doing here.

I have a few suggestsions though:

I suspect that asking for the exonic sequences is the real bottleneck here, so if you don't need these then drop the "gene_exon" attribute and it'll work more speedily.

Alternatively, you can break your query down into sub-queries with smaller blocks of genes. Each of these subqueries will get a 5 minute timeout, so you've much more chance of this working. biomaRt actually does this automatically if you provide the filters and values arguments, but can't do this if you're bulk downloading everything. You can emulate the behaviour by downloading all ensembl_gene_ids and then passing these to a second query e.g.

all_gene_ids <- biomaRt::getBM(attributes = c("ensembl_gene_id"),
                               mart = ensembl)
res2 <- biomaRt::getBM(attributes=c("ensembl_gene_id", "ensembl_exon_id",
                                  "exon_chrom_start", "exon_chrom_end",
                                  "chromosome_name", "gene_exon"),
                      filters = "ensembl_gene_id",
                      values = all_gene_ids,
                      mart = ensembl)

You could also use a combination of the ensembldb and BSgenome packages to get this information without relying on a connection to BioMart at all. There's a slightly large initial download, but after that it should be pretty quick e.g.

library(ensembldb)
library(AnnotationHub)
## Load the annotation resource & select the Ensembl 97 human data
ah <- AnnotationHub()
ahDb <- query(ah, pattern = c("Ensembl 97 EnsDb for Homo sapiens"))
ahEdb <- ahDb[[1]]

## Get the exon information for the main chromosomes
ex <- exons(filter(ahEdb, filter = SeqNameFilter(c(1:22,"X","Y","MT"))))

library(BSgenome.Hsapiens.NCBI.GRCh38)
bsg <- BSgenome.Hsapiens.NCBI.GRCh38
## we need to flag the mitochondrial genome as circular
isCircular(ex)[seqlevels(ex) == "MT"] <- TRUE
## extract the exonic sequences
exon_seqs <- getSeq(bsg, names = ex)

Here the ex and exon_seqs objects contain the names, coordinates and sequences of the exons, which you could combine together.

mschubert commented 4 years ago

The sequences were indeed the culprit here, I wasn't sure what gene_exon meant but I found it in another query so I left it in.

Query works fine without it.

However, for certain kinds of tasks a user may still set a custom timeout (longer or shorter), so I'm wondering whether the request still has merit.