lawremi / rtracklayer

R interface to genome annotation files and the UCSC genome browser
Other
26 stars 16 forks source link

getTable ignores query ranges #85

Closed dorkylever closed 1 year ago

dorkylever commented 1 year ago

Hi All,

I have a set of mouse SNPs (~974) from GRCm39 that I'm trying to get conservation scores on. Ultimately I would like to get the GERP conservation score files available from the the Ensembl FTP site, import the BigWig track into rtracklayer and retrieve the scores for each SNP. Just for testing, however, I'm using the multiz35way conservation score as it's pre-built.

However, when I do this, the ranges are ignored and it returns just max possible entries for chromosome one. library(rtracklayer) session <- browserSession("UCSC") genome(session) <- "mm39" clone_for_snp_gr <- snps_filtered_from_FVB clone_for_snp_gr$chr_name <- unlist(clone_for_snp_gr$chr_name) clone_for_snp_gr$chrom_start <- unlist(clone_for_snp_gr$chrom_start) clone_for_snp_gr$chrom_end <- unlist(clone_for_snp_gr$chrom_end) snp_gr <-makeGRangesFromDataFrame(clone_for_snp_gr, seqnames.field = "chr_name", start.field = "chrom_start", end.field = "chrom_end") snp_gr <- renameSeqlevels(snp_gr, paste0("chr", seqlevels(snp_gr))) for_query <- GRangesForUCSCGenome("mm39", chrom = snp_gr@seqnames, ranges = snp_gr@ranges) for_query GRanges object with 974 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr9 119800452 * [2] chr7 44199769 * [3] chr7 132616737 * [4] chr4 136275390 * [5] chr4 136275455 * ... ... ... ... [970] chr9 120868329 * [971] chr9 120868349 * [972] chr2 165916862 * [973] chr9 71837860 * [974] chrX 71962690 *

seqinfo: 61 sequences (1 circular) from mm39 genome test <- ucscTableQuery(session, track="cons35way", range=for_query, table="multiz35way") con_scores <- getTable(test) con_scores bin chrom chromStart chromEnd extFile offset score 1 0 chr1 67108799 67108920 1 3567081335 0.501910 2 0 chr1 134217621 134217740 1 7240893127 0.886047 3 1 chr1 8387705 8389370 1 172116144 0.510487 4 1 chr1 16777183 16777248 1 618514000 0.501355 5 1 chr1 25165823 25165834 1 1118818193 0.727571 6 1 chr1 33554431 33554447 1 1422195359 0.547642 7 1 chr1 41942996 41943218 1 1932068494 0.501428 8 1 chr1 50331618 50332996 1 2328581995 0.500599 9 1 chr1 58719908 58722745 1 2933344225 0.500000 10 2 chr1 75497420 75497476 1 4222662685 0.499302 11 2 chr1 83886033 83886164 1 4835125740 0.502277 12 2 chr1 92274310 92275035 1 5366193570 0.502864 13 2 chr1 100663007 100663507 1 5709210931 0.503867 14 2 chr1 109049041 109054290 1 6051152786 0.500000 15 2 chr1 117440159 117440543 1 6238167617 0.500000 16 2 chr1 125829054 125829155 1 6626874554 0.500635 [ reached 'max' / getOption("max.print") -- omitted 999858 rows ]`

Kind Regards, Kyle

dorkylever commented 1 year ago

Sorry for the terrible formatting

It's the same issue as https://www.biostars.org/p/9552594/

dorkylever commented 1 year ago

This seems independent of the query:

gr <- GRanges(seqnames = c("chr1", "chr1", "chr2", "chr2"), ranges = IRanges(start = c(100, 200, 300, 400), end = c(150, 250, 350, 450)), strand = c("+", "-", "+", "-"))

test <- ucscTableQuery(session, track = "cons35way", range = gr, table = "multiz35way") ans <- rtracklayer::getTable(test)

str(ans) 'data.frame': 1000000 obs. of 7 variables: $ bin : num 0 0 1 1 1 1 1 1 1 2 ... $ chrom : chr "chr1" "chr1" "chr1" "chr1" ... $ chromStart: num 6.71e+07 1.34e+08 8.39e+06 1.68e+07 2.52e+07 ... $ chromEnd : num 6.71e+07 1.34e+08 8.39e+06 1.68e+07 2.52e+07 ... $ extFile : num 1 1 1 1 1 1 1 1 1 1 ... $ offset : num 3.57e+09 7.24e+09 1.72e+08 6.19e+08 1.12e+09 ... $ score : num 0.502 0.886 0.51 0.501 0.728 ...

I don't understand

dorkylever commented 1 year ago

Wait, can the Grange object only have a single value?

i.e., something like: GRanges("chr1", IRanges(start = 38561922, end = 38561922), strand = "-")

dorkylever commented 1 year ago

So I'm having to do an sapply for multiple queries like such:

sapply(seq_along(for_query), function(i) { Sys.sleep(3) getTable(ucscTableQuery(mySession, track="cons35way",range=for_query[i], table="multiz35way"))})

but this seems inconventient and you get errorHandler(responseError) : Too Many Requests

dorkylever commented 1 year ago

nevermind there's myRSQL that I should probably be using. I just got baited by the fact that the input was a Granges objects, thinking it should have been for more high-throughput-.

dorkylever commented 1 year ago

For anybosy else that gets stuck on getting GERP scores in the void - look at this post:

https://www.biostars.org/p/9558532/#9558724