Closed dvantwisk closed 7 years ago
There seems to be an issue just downloading that table through the UCSC table browser web interface, so I guess this is a UCSC issue. It just pauses after downloading about ~1.44 MB, like the size of an old floppy. After a minute or so, it resumes. I guess RCurl is not as robust to that as my web browser.
Hi Michael,
This problem came up when Daniel was building the TxDbs for the upcoming release. To get around this for rheMac8 we can manually download the table and use makeTxDbPackage
.
In the bigger picture, do you think this should be reported to UCSC? If yes, what would be a clean example to send them? Just a browser query of the refGene table for rheMac8?
Val
You could just tell them about the issues with downloading Refseq genes for rheMac8 via the table browser. I don't think there's a clean URL or anything for commuicating that.
OK thanks. @dvantwisk can you report this to UCSC?
Sure, I'll tell them about this.
@lawremi It's not clear how to report this problem to UCSC. rtracklayer uses RCurl and passes a cookie. When we just curl the url and don't pass a cookie it works. We want to report the problem in a Bioconductor-independent way so that's probably just an example of a query through the UCSC browser web interface.
Daniel mentioned there are several interfaces for downloading these files from the UCSC browser - sorry I'm not more familiar with this. It would be helpful if you could send us the link to the browser interface where you observed the lag in download.
This is the table browser I was referring to:
Thanks Michael. @dvantwisk can you reproduce the delay via this link? If yes, it should be enough for a bug report.
@lawremi Note that getTable()
uses the following call to RCurl::getForm()
internally to retrieve the table data:
library(RCurl)
url <- "http://genome.ucsc.edu/cgi-bin/hgTables"
.form <- list(
hgta_group="allTracks",
hgta_track=c(`RefSeq Genes`="refGene"),
hgta_regionType="genome",
hgta_table="refGene",
hgta_outputType="primaryTable",
hgta_compressType="none",
hgta_doTopSubmit="get output",
hgsid.value="610687863_GACcmloVspXbhMXNWq3oOCmyBznN"
)
.opts <- list(
cookie="hguid=593208669_Ck1llY7d937McU6ABT0aKtvdExOS",
useragent="rtracklayer"
)
out <- getForm(url, .params=.form, .opts=.opts)
As noted by Val and Daniel, what's getting in the way here is the cookie. If you just use:
.opts <- list(
useragent="rtracklayer"
)
instead then the call to getForm()
works.
Is there any particular reason to use a cookie for downloading the table data?
Thanks,
H.
Of course this should not prevent us from reporting the problem to the UCSC folks (if we manage to reproduce it with their tools). Was just wondering how hard it would be to make getTable()
more robust to these kinds of problems.
H.
I don't know if the hguid is necessary or not. Perhaps it's only needed when working with custom (user uploaded) tracks. Are you sure that you're getting the right data (i.e., not human data) after removing the cookie? If so, then I guess there could be a way to disable the sending of the cookie in ucscGet()
. Super weird that things break though, because the browser is obviously sending the cookie.
I think you can control the assembly by adding something like db="rheMac8"
to .form
. Also hgsid.value
and hguid
don't seem to be needed. With a simpler example (Assembly track for sacCer3):
library(RCurl)
url <- "http://genome.ucsc.edu/cgi-bin/hgTables"
.form <- list(
db="sacCer3",
hgta_group="map",
hgta_track="gold",
hgta_regionType="genome",
hgta_table="gold",
hgta_outputType="primaryTable",
hgta_compressType="none",
hgta_doTopSubmit="get output"
)
.opts <- list(
useragent="rtracklayer"
)
out <- getForm(url, .params=.form, .opts=.opts)
Then:
> cat(out)
#bin chrom chromStart chromEnd ix type frag fragStart fragEnd strand
73 chrI 0 230218 1 O BK006935.2 0 230218 +
585 chrM 0 85779 1 F NC_001224 0 85779 +
73 chrV 0 576874 1 O BK006939.2 0 576874 +
73 chrX 0 745751 1 O BK006943.2 0 745751 +
73 chrII 0 813184 1 O BK006936.2 0 813184 +
9 chrIV 0 1531933 1 O BK006938.2 0 1531933 +
73 chrIX 0 439888 1 O BK006942.2 0 439888 +
73 chrVI 0 270161 1 O BK006940.2 0 270161 +
73 chrXI 0 666816 1 O BK006944.2 0 666816 +
9 chrXV 0 1091291 1 O BK006948.2 0 1091291 +
73 chrIII 0 316620 1 O BK006937.2 0 316620 +
9 chrVII 0 1090940 1 O BK006941.2 0 1090940 +
9 chrXII 0 1078177 1 O BK006945.2 0 1078177 +
73 chrXIV 0 784333 1 O BK006947.3 0 784333 +
73 chrXVI 0 948066 1 O BK006949.2 0 948066 +
73 chrVIII 0 562643 1 O BK006934.2 0 562643 +
73 chrXIII 0 924431 1 O BK006946.2 0 924431 +
H.
Yea but I am relying on UCSC for maintaining that state (the genome) via the hgsid. The hgsid is used throughout the UCSC website, so it's simplest to just refer to that. The hguid might only be for custom tracks, so maybe we could drop it from that context.
Was a FWIW only. Always a good exercise to try to identify the "minimum requirements".
I changed it to drop the cookie when interacting with the table browser. Seems to work for me now. Sorry it took so long.
We've encountered an error while trying to build TxDb packages from
GenomicFeatures
and have isolated it tortracklayer
. As the title states, we are unable to query the rheMac8 genome and we receive a Gateway Timeout Error. This code seems to be failing uniquely on this genome and functions correctly for all others attempted (e.g. rheMac3 as the genome works). The below code is a reproducible example of the error. Do you have any insight on this issue?