grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

getBM error in documentation example #3

Closed stolarczyk closed 6 years ago

stolarczyk commented 6 years ago

The getBM() function yielded some errors (error in curl::curl_fetch_memory(url, handle = handle) : timeout was reached) when used for my purposes. After running the example code from the docs I got the following:

> mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
> getBM(attributes = c("affy_hg_u95av2", "hgnc_symbol", "chromosome_name", "band"),
+       filters    = "affy_hg_u95av2",
+       values     = c("1939_at","1503_at","1454_at"), 
+       mart       = mart)
No encoding supplied: defaulting to UTF-8.
Error in getBM(attributes = c("affy_hg_u95av2", "hgnc_symbol", "chromosome_name",  : 
  The query to the BioMart webservice returned an invalid result: biomaRt expected a character string of length 1. Please report this to the mailing list.
> 
> packageVersion('biomaRt')
[1] ‘2.35.12’
grimbough commented 6 years ago

Thanks for reporting this, however the code example works for me:

> getBM(attributes = c("affy_hg_u95av2", "hgnc_symbol", "chromosome_name", "band"),
+              filters    = "affy_hg_u95av2",
+              values     = c("1939_at","1503_at","1454_at"), 
+              mart       = mart)
  affy_hg_u95av2 hgnc_symbol chromosome_name   band
1        1503_at       BRCA2              13  q13.1
2        1454_at       SMAD3              15 q22.33
3        1939_at        TP53              17  p13.1

I've seen the timeout was reached error before and thought I'd resolved it, but I guess it depends on the size of your query and the specifics of your network configuration. I've raised the timeout from 1 minute to 5 minutes, so you can give that a try. It's available in version 2.35.15

BiocInstaller::biocLite('grimbough/biomaRt')

You can also try querying a more local Ensembl mirror to see if that improves latency e.g.

mart <- useMart(biomart = "ensembl", 
                dataset = "hsapiens_gene_ensembl", 
                host = "useast.ensembl.org")
stolarczyk commented 6 years ago

Thank you for your response.

After updating the package to the version you've recommended (and all its dependencies) the example worked. However, I'm still getting the curl error when executing my own query:

mart_name <- useEnsembl(biomart = "ensembl",dataset = "ggallus_gene_ensembl",mirror = "useast",verbose = T)

Attempting web service request:
http://useast.ensembl.org:80/biomart/martservice?type=version&requestid=biomaRt&mart=ENSEMBL_MART_ENSEMBL
   V1
1 0.7
BioMartServer running BioMart version: 0.7
Mart virtual schema: default
Mart host: http://useast.ensembl.org:80/biomart/martservice
Checking attributes ...Attempting web service request:
http://useast.ensembl.org:80/biomart/martservice?type=attributes&dataset=ggallus_gene_ensembl&requestid=biomaRt&mart=ENSEMBL_MART_ENSEMBL&virtualSchema=default
 ok
Checking filters ...Attempting web service request:
http://useast.ensembl.org:80/biomart/martservice?type=filters&dataset=ggallus_gene_ensembl&requestid=biomaRt&mart=ENSEMBL_MART_ENSEMBL&virtualSchema=default
 ok

getBM(
+   attributes = c("ensembl_transcript_id","cdna_coding_start","cdna_coding_end","cdna"),
+   mart = mart_name
+ ))
Error in curl::curl_fetch_memory(url, handle = handle) : 
  transfer closed with outstanding read data remaining

How to determine whether the problem is on my end?

grimbough commented 6 years ago

I think there may be something malformed deeper in this dataset. The code below takes two steps, first it gets all the gene names, then it gets the transcript information in a second query.

This seems a round-the-houses approach, but the advantage is that when you provide a list of values biomaRt will automatically break it down into chunks of 500 and submit separate queries for each chunk. That should be a little faster, gives a nice progress bar for long queries, and there's less chance of a CURL timeout.

genes <- getBM(attributes = "ensembl_gene_id",
               mart = mart_name)

transcripts <- getBM(values = genes$ensembl_gene_id,
                     filters = "ensembl_gene_id",
                     attributes = c("ensembl_transcript_id",
                                    "cdna_coding_start",
                                    "cdna_coding_end",
                                    "cdna"),
                     mart = mart_name)

However, I consistently get the following error or something similar:

 Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 165 did not have 4 elements 

This suggests something is being returned that doesn't have the 4 fields you've requested, but I don't know what. I think there's a fair chance that's causing the problem in your initial query too. I'll take a look and see if I can understand what's causing it.

grimbough commented 6 years ago

I've just realised you're trying to get sequence data with this query. Obtaining sequences can be a bit weird from BioMart as it returns them in FASTA format, rather than the regular tabular structure everything else is retrieved in, and then biomaRt has to parse that. There's a the function getSequence() which tries to handle this, but it's not super flexible about what extra data is included other than the ID and the sequence itself.

I don't know what your ultimate aim is, but maybe you can do something like the following, which returns one table with transcript IDs and CDNA sequences, and a second table with the transcript IDs and the CDNA start/end values. You can then use the ID column to match entries and get to your end point.

mart <- useEnsembl(biomart = "ensembl",dataset = "ggallus_gene_ensembl")

## vector of all gene IDs (only used to improve processing later)
genes <- getBM(attributes = "ensembl_gene_id",
               mart = mart)

## first table with CDNA start/end values
transcripts <- getBM(values = genes$ensembl_gene_id,
                     filters = "ensembl_gene_id",
                     attributes = c("ensembl_transcript_id",
                                    "cdna_coding_start",
                                    "cdna_coding_end"),
                     mart = mart)

## second table with sequences
transcripts2 <- getBM(values = genes$ensembl_gene_id,
                     filters = "ensembl_gene_id",
                     attributes = c("ensembl_transcript_id",
                                    "cdna"),
                     mart = mart)
stolarczyk commented 6 years ago

Great, looks like a properly working approach. I'll get back to you once I test it in multiple scenarios within my pipeline.

Thanks!

stolarczyk commented 6 years ago

I've tested the suggested approach within my pipeline for multiple organisms and attributes. And it generally worked until today - I'm randomly getting an error. In this case for example:

> mart <- useEnsembl(biomart = "ensembl",dataset ="mmulatta_gene_ensembl",host = "useast.ensembl.org")
> # vector of all gene IDs (only used to improve processing later)
> genes <- getBM(attributes = "ensembl_gene_id", mart = mart)
> # first table with CDNA start/end values
> transcripts <- getBM(values = genes$ensembl_gene_id, filters = "ensembl_gene_id", attributes = c("ensembl_transcript_id","ensembl_peptide_id","cdna_coding_start","cdna_coding_end"), mart = mart)
> # process the dataframe: collapse entries with the same transcript ID                                                                      
> transcripts_mod = transcripts %>% group_by(ensembl_transcript_id) %>% summarise(ensembl_peptide_id = unique(ensembl_peptide_id)[1],START = paste(cdna_coding_start,collapse = ","), STOP = paste(cdna_coding_end,collapse = ","))
> rm(transcripts)
> # second table with sequences
> transcripts2 <- getBM(values = genes$ensembl_gene_id, filters = "ensembl_gene_id", attributes = c("ensembl_transcript_id","cdna"),mart = mart)
Batch submitting query [=============-----------------------------------------------------------------------------------------]  12% eta:  9mError in curl::curl_fetch_memory(url, handle = handle) : 
  transfer closed with outstanding read data remaining

The error does not emerge for the specific organism or set of attributes. It just happens randomly. And as mentioned above - today it does almost every time.

grimbough commented 6 years ago

Connection problems probably aren't related to the orginal issue. Please open another issue if you continue to have timeouts.