lawremi / rtracklayer

R interface to genome annotation files and the UCSC genome browser
Other
29 stars 17 forks source link

Gateway Timeout Error while querying rheMac8 genome from UCSC #2

Closed dvantwisk closed 7 years ago

dvantwisk commented 7 years ago

We've encountered an error while trying to build TxDb packages from GenomicFeatures and have isolated it to rtracklayer. As the title states, we are unable to query the rheMac8 genome and we receive a Gateway Timeout Error. This code seems to be failing uniquely on this genome and functions correctly for all others attempted (e.g. rheMac3 as the genome works). The below code is a reproducible example of the error. Do you have any insight on this issue?

library(rtracklayer)
session <- browserSession()
genome(session) <- "rheMac8"
query <- ucscTableQuery(session, "RefSeq Genes", table="refGene")
ucsc_txtable <- getTable(query)
## Error: Gateway Time-out
lawremi commented 7 years ago

There seems to be an issue just downloading that table through the UCSC table browser web interface, so I guess this is a UCSC issue. It just pauses after downloading about ~1.44 MB, like the size of an old floppy. After a minute or so, it resumes. I guess RCurl is not as robust to that as my web browser.

vobencha commented 7 years ago

Hi Michael,

This problem came up when Daniel was building the TxDbs for the upcoming release. To get around this for rheMac8 we can manually download the table and use makeTxDbPackage.

In the bigger picture, do you think this should be reported to UCSC? If yes, what would be a clean example to send them? Just a browser query of the refGene table for rheMac8?

Val

lawremi commented 7 years ago

You could just tell them about the issues with downloading Refseq genes for rheMac8 via the table browser. I don't think there's a clean URL or anything for commuicating that.

vobencha commented 7 years ago

OK thanks. @dvantwisk can you report this to UCSC?

dvantwisk commented 7 years ago

Sure, I'll tell them about this.

vobencha commented 7 years ago

@lawremi It's not clear how to report this problem to UCSC. rtracklayer uses RCurl and passes a cookie. When we just curl the url and don't pass a cookie it works. We want to report the problem in a Bioconductor-independent way so that's probably just an example of a query through the UCSC browser web interface.

Daniel mentioned there are several interfaces for downloading these files from the UCSC browser - sorry I'm not more familiar with this. It would be helpful if you could send us the link to the browser interface where you observed the lag in download.

lawremi commented 7 years ago

This is the table browser I was referring to:

http://genome.ucsc.edu/cgi-bin/hgTables

vobencha commented 7 years ago

Thanks Michael. @dvantwisk can you reproduce the delay via this link? If yes, it should be enough for a bug report.

hpages commented 7 years ago

@lawremi Note that getTable() uses the following call to RCurl::getForm() internally to retrieve the table data:

library(RCurl)
url <- "http://genome.ucsc.edu/cgi-bin/hgTables"
.form <- list(
  hgta_group="allTracks",
  hgta_track=c(`RefSeq Genes`="refGene"),
  hgta_regionType="genome",
  hgta_table="refGene",
  hgta_outputType="primaryTable",
  hgta_compressType="none",
  hgta_doTopSubmit="get output",
  hgsid.value="610687863_GACcmloVspXbhMXNWq3oOCmyBznN"
)
.opts <- list(
  cookie="hguid=593208669_Ck1llY7d937McU6ABT0aKtvdExOS",
  useragent="rtracklayer"
)
out <- getForm(url, .params=.form, .opts=.opts)

As noted by Val and Daniel, what's getting in the way here is the cookie. If you just use:

.opts <- list(
  useragent="rtracklayer"
)

instead then the call to getForm() works. Is there any particular reason to use a cookie for downloading the table data? Thanks, H.

hpages commented 7 years ago

Of course this should not prevent us from reporting the problem to the UCSC folks (if we manage to reproduce it with their tools). Was just wondering how hard it would be to make getTable() more robust to these kinds of problems. H.

lawremi commented 7 years ago

I don't know if the hguid is necessary or not. Perhaps it's only needed when working with custom (user uploaded) tracks. Are you sure that you're getting the right data (i.e., not human data) after removing the cookie? If so, then I guess there could be a way to disable the sending of the cookie in ucscGet(). Super weird that things break though, because the browser is obviously sending the cookie.

hpages commented 7 years ago

I think you can control the assembly by adding something like db="rheMac8" to .form. Also hgsid.value and hguid don't seem to be needed. With a simpler example (Assembly track for sacCer3):

library(RCurl)
url <- "http://genome.ucsc.edu/cgi-bin/hgTables"
.form <- list(
  db="sacCer3",
  hgta_group="map",
  hgta_track="gold",
  hgta_regionType="genome",
  hgta_table="gold",
  hgta_outputType="primaryTable",
  hgta_compressType="none",
  hgta_doTopSubmit="get output"
)
.opts <- list(
  useragent="rtracklayer"
)
out <- getForm(url, .params=.form, .opts=.opts)

Then:

> cat(out)
#bin    chrom   chromStart  chromEnd    ix  type    frag    fragStart   fragEnd strand
73  chrI    0   230218  1   O   BK006935.2  0   230218  +
585 chrM    0   85779   1   F   NC_001224   0   85779   +
73  chrV    0   576874  1   O   BK006939.2  0   576874  +
73  chrX    0   745751  1   O   BK006943.2  0   745751  +
73  chrII   0   813184  1   O   BK006936.2  0   813184  +
9   chrIV   0   1531933 1   O   BK006938.2  0   1531933 +
73  chrIX   0   439888  1   O   BK006942.2  0   439888  +
73  chrVI   0   270161  1   O   BK006940.2  0   270161  +
73  chrXI   0   666816  1   O   BK006944.2  0   666816  +
9   chrXV   0   1091291 1   O   BK006948.2  0   1091291 +
73  chrIII  0   316620  1   O   BK006937.2  0   316620  +
9   chrVII  0   1090940 1   O   BK006941.2  0   1090940 +
9   chrXII  0   1078177 1   O   BK006945.2  0   1078177 +
73  chrXIV  0   784333  1   O   BK006947.3  0   784333  +
73  chrXVI  0   948066  1   O   BK006949.2  0   948066  +
73  chrVIII 0   562643  1   O   BK006934.2  0   562643  +
73  chrXIII 0   924431  1   O   BK006946.2  0   924431  +

H.

lawremi commented 7 years ago

Yea but I am relying on UCSC for maintaining that state (the genome) via the hgsid. The hgsid is used throughout the UCSC website, so it's simplest to just refer to that. The hguid might only be for custom tracks, so maybe we could drop it from that context.

hpages commented 7 years ago

Was a FWIW only. Always a good exercise to try to identify the "minimum requirements".

lawremi commented 7 years ago

I changed it to drop the cookie when interacting with the table browser. Seems to work for me now. Sorry it took so long.