Closed stolarczyk closed 6 years ago
Thanks for reporting this, however the code example works for me:
> getBM(attributes = c("affy_hg_u95av2", "hgnc_symbol", "chromosome_name", "band"),
+ filters = "affy_hg_u95av2",
+ values = c("1939_at","1503_at","1454_at"),
+ mart = mart)
affy_hg_u95av2 hgnc_symbol chromosome_name band
1 1503_at BRCA2 13 q13.1
2 1454_at SMAD3 15 q22.33
3 1939_at TP53 17 p13.1
I've seen the timeout was reached
error before and thought I'd resolved it, but I guess it depends on the size of your query and the specifics of your network configuration. I've raised the timeout from 1 minute to 5 minutes, so you can give that a try. It's available in version 2.35.15
BiocInstaller::biocLite('grimbough/biomaRt')
You can also try querying a more local Ensembl mirror to see if that improves latency e.g.
mart <- useMart(biomart = "ensembl",
dataset = "hsapiens_gene_ensembl",
host = "useast.ensembl.org")
Thank you for your response.
After updating the package to the version you've recommended (and all its dependencies) the example worked. However, I'm still getting the curl
error when executing my own query:
mart_name <- useEnsembl(biomart = "ensembl",dataset = "ggallus_gene_ensembl",mirror = "useast",verbose = T)
Attempting web service request:
http://useast.ensembl.org:80/biomart/martservice?type=version&requestid=biomaRt&mart=ENSEMBL_MART_ENSEMBL
V1
1 0.7
BioMartServer running BioMart version: 0.7
Mart virtual schema: default
Mart host: http://useast.ensembl.org:80/biomart/martservice
Checking attributes ...Attempting web service request:
http://useast.ensembl.org:80/biomart/martservice?type=attributes&dataset=ggallus_gene_ensembl&requestid=biomaRt&mart=ENSEMBL_MART_ENSEMBL&virtualSchema=default
ok
Checking filters ...Attempting web service request:
http://useast.ensembl.org:80/biomart/martservice?type=filters&dataset=ggallus_gene_ensembl&requestid=biomaRt&mart=ENSEMBL_MART_ENSEMBL&virtualSchema=default
ok
getBM(
+ attributes = c("ensembl_transcript_id","cdna_coding_start","cdna_coding_end","cdna"),
+ mart = mart_name
+ ))
Error in curl::curl_fetch_memory(url, handle = handle) :
transfer closed with outstanding read data remaining
How to determine whether the problem is on my end?
I think there may be something malformed deeper in this dataset. The code below takes two steps, first it gets all the gene names, then it gets the transcript information in a second query.
This seems a round-the-houses approach, but the advantage is that when you provide a list of values
biomaRt will automatically break it down into chunks of 500 and submit separate queries for each chunk. That should be a little faster, gives a nice progress bar for long queries, and there's less chance of a CURL timeout.
genes <- getBM(attributes = "ensembl_gene_id",
mart = mart_name)
transcripts <- getBM(values = genes$ensembl_gene_id,
filters = "ensembl_gene_id",
attributes = c("ensembl_transcript_id",
"cdna_coding_start",
"cdna_coding_end",
"cdna"),
mart = mart_name)
However, I consistently get the following error or something similar:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 165 did not have 4 elements
This suggests something is being returned that doesn't have the 4 fields you've requested, but I don't know what. I think there's a fair chance that's causing the problem in your initial query too. I'll take a look and see if I can understand what's causing it.
I've just realised you're trying to get sequence data with this query. Obtaining sequences can be a bit weird from BioMart as it returns them in FASTA format, rather than the regular tabular structure everything else is retrieved in, and then biomaRt has to parse that. There's a the function getSequence()
which tries to handle this, but it's not super flexible about what extra data is included other than the ID and the sequence itself.
I don't know what your ultimate aim is, but maybe you can do something like the following, which returns one table with transcript IDs and CDNA sequences, and a second table with the transcript IDs and the CDNA start/end values. You can then use the ID column to match entries and get to your end point.
mart <- useEnsembl(biomart = "ensembl",dataset = "ggallus_gene_ensembl")
## vector of all gene IDs (only used to improve processing later)
genes <- getBM(attributes = "ensembl_gene_id",
mart = mart)
## first table with CDNA start/end values
transcripts <- getBM(values = genes$ensembl_gene_id,
filters = "ensembl_gene_id",
attributes = c("ensembl_transcript_id",
"cdna_coding_start",
"cdna_coding_end"),
mart = mart)
## second table with sequences
transcripts2 <- getBM(values = genes$ensembl_gene_id,
filters = "ensembl_gene_id",
attributes = c("ensembl_transcript_id",
"cdna"),
mart = mart)
Great, looks like a properly working approach. I'll get back to you once I test it in multiple scenarios within my pipeline.
Thanks!
I've tested the suggested approach within my pipeline for multiple organisms and attributes. And it generally worked until today - I'm randomly getting an error. In this case for example:
> mart <- useEnsembl(biomart = "ensembl",dataset ="mmulatta_gene_ensembl",host = "useast.ensembl.org")
> # vector of all gene IDs (only used to improve processing later)
> genes <- getBM(attributes = "ensembl_gene_id", mart = mart)
> # first table with CDNA start/end values
> transcripts <- getBM(values = genes$ensembl_gene_id, filters = "ensembl_gene_id", attributes = c("ensembl_transcript_id","ensembl_peptide_id","cdna_coding_start","cdna_coding_end"), mart = mart)
> # process the dataframe: collapse entries with the same transcript ID
> transcripts_mod = transcripts %>% group_by(ensembl_transcript_id) %>% summarise(ensembl_peptide_id = unique(ensembl_peptide_id)[1],START = paste(cdna_coding_start,collapse = ","), STOP = paste(cdna_coding_end,collapse = ","))
> rm(transcripts)
> # second table with sequences
> transcripts2 <- getBM(values = genes$ensembl_gene_id, filters = "ensembl_gene_id", attributes = c("ensembl_transcript_id","cdna"),mart = mart)
Batch submitting query [=============-----------------------------------------------------------------------------------------] 12% eta: 9mError in curl::curl_fetch_memory(url, handle = handle) :
transfer closed with outstanding read data remaining
The error does not emerge for the specific organism or set of attributes. It just happens randomly. And as mentioned above - today it does almost every time.
Connection problems probably aren't related to the orginal issue. Please open another issue if you continue to have timeouts.
The
getBM()
function yielded some errors (error in curl::curl_fetch_memory(url, handle = handle) : timeout was reached
) when used for my purposes. After running the example code from the docs I got the following: