Closed lcolladotor closed 9 years ago
As I said here, I think that with some docs on what the metadata columns mean, users could construct the GRanges
object with the necessary metadata. Then, using it, they can get the equivalent output to subject='hg19'
for their genome.
Alternatively, a function that takes a TxDb
object and creates a GRanges
object with the proper metadata would work too.
bumphunter:::known_transcripts() seems to process a TxDb UCSC knownGene package and returns a list in the same format as the built-in TT object. For example, to generate a transcript database from the more recent human GRCh38 using the TxDb.Hsapiens.UCSC.hg38.knownGene package:
> txHs38 = bumphunter:::known_transcripts(species = "Hsapiens", build="hg38")
One idea would be to allow the annotateNearest subject argument to accept these objects. i.e.:
> gr <- GRanges('chr1', IRanges(1, 5000))
> annotateNearest(gr, subject=txHs38)
Like lcolladotor suggests, we could also generalize bumphunter:::knowntranscripts to allow TxDb objects other than those in the TxDb..UCSC._.knownGene packages.
What do people think?
I think generalizing is the way to go. However, it should default to whatever we default now.
Supporting all TxDb objects makes total sense; it is probably easy. But I have not used this function for a long while, so I am not volunteering.
Best, Kasper
On Sat, Nov 22, 2014 at 1:29 PM, ririzarr notifications@github.com wrote:
I think generalizing is the way to go. However, it should default to whatever we default now.
— Reply to this email directly or view it on GitHub https://github.com/ririzarr/bumphunter/issues/1#issuecomment-64090216.
Instead of extending the annotation functionality of annotateNearest, we could consider instead adding a wrapper around existing annotation tools provided by other packages. For example, using VariantAnnotation with a standard TxDB object:
> library("VariantAnnotation")
> library("TxDb.Hsapiens.UCSC.hg19.knownGene")
> gr <- GRanges(c('chr10', 'chr10', 'chr3'), IRanges(c(131255354, 131265354, 37035288), c(131255554, 131265554, 37035368)))
> ann <- locateVariants(gr, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants())
> ann
GRanges object with 16 ranges and 9 metadata columns:
seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID GENEID
<Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <integer> <integer> <character>
[1] chr10 [131255354, 131255554] * | intergenic <NA> <NA> 1 <NA> <NA> <NA>
[2] chr10 [131265354, 131265554] + | promoter <NA> <NA> 2 39172 <NA> 4255
[3] chr3 [ 37035288, 37035368] + | intron 134 214 3 13445 <NA> 4292
[4] chr3 [ 37035288, 37035368] + | intron 134 214 3 13446 <NA> 4292
[5] chr3 [ 37035288, 37035368] + | intron 134 214 3 13447 <NA> 4292
... ... ... ... ... ... ... ... ... ... ... ...
[12] chr3 [37035288, 37035368] + | fiveUTR 21 101 3 13451 <NA> 4292
[13] chr3 [37035288, 37035368] + | promoter <NA> <NA> 3 13451 <NA> 4292
[14] chr3 [37035288, 37035368] + | fiveUTR 21 101 3 13452 <NA> 4292
[15] chr3 [37035288, 37035368] + | promoter <NA> <NA> 3 13452 <NA> 4292
[16] chr3 [37035288, 37035368] - | promoter <NA> <NA> 3 15577 <NA> 9852
PRECEDEID FOLLOWID
<CharacterList> <CharacterList>
[1] 100422873,10539,119437,...
[2]
[3]
[4]
[5]
... ... ...
[12]
[13]
[14]
[15]
[16]
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
We could add a simple wrapper that processes this annotated GRranges object to contain a single output range per input range, as annotateNearest does. An advantage of this approach is that we don't have to duplicate annotation functionality that exists elsewhere in Bioconductor.
Hi all. I have successfully created a dogTT object, similar to the TT object that comes with bumphunter. I'd love to be able to hand this object to annotateNearest() so that I can use the built-in annotation functions in derfinder. I'm trying to get a feel for what the status is on bumphunter's end. Can I help in any way? I'm relatively new to R but I have experience in other languages; I don't know bumphunter at all. So I can try to take a crack at doing some of this work, but I don't know whether I'm offering to get in over my head. Thanks, Jessica
So I took at look at the source for annotateNearest, and it appeared that it would do what I wanted without modifications. I built my transcripts object like so:
library(TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene)
library(RMySQL)
library(bumphunter)
tx <- transcripts(TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene)
names(tx) <- unlist(mcols(tx)[,2])
con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user="genome", dbname="canFam3")
refGene <- dbGetQuery(con, "select * from xenoRefGene;")
refGene <- refGene[match(names(tx), refGene$name),]
Nexons = sapply(strsplit(refGene$exonStarts, ","), length)
Exons <- IRangesList(start = lapply(strsplit(refGene$exonStarts, ","), function(x) x[x != "" & !is.na(x)]), end = lapply(strsplit(refGene$exonEnds, ","), function(x) x[x != "" & !is.na(x)]))
mcols(tx) <- DataFrame(CSS = refGene$cdsStart, CSE = refGene$cdsEnd, Tx = refGene$name, Gene = refGene$name2, Nexons = Nexons, Exons = Exons)
I annotated like so:
annotation <- annotateNearest(regions$regions, tx)
(where regions$regions is derfinder output -- I am happy to provide more info there if anyone is curious).
The results are somewhat annotated:
> head(annotation)
dist matchIndex type amountOverlap insideDist size1 size2
1 0 15454 inside NA 59 125 5819
2 0 10018 inside NA -1811 141 15559
3 0 10017 inside NA -34060 118 68867
4 0 13154 inside NA -22676 93 169910
5 0 876 inside NA -30307 59 157801
6 0 13167 inside NA 2095 114 12769
I am surprised that the annotation does not include the gene name or RefSeq ID, as the docs suggest it will. My suspicion is that I did not hand enough information along in my tx object, but the object does contain seqnames, and those weren't passed back in the annotation object.
> tx
GRanges object with 254017 ranges and 6 metadata columns:
seqnames ranges strand | CSS CSE
<Rle> <IRanges> <Rle> | <numeric> <numeric>
NM_011767 chr1 [363561, 368894] + | 75038343 75122798
NM_001011177 chr1 [363601, 367239] + | 363606 367239
NM_001090507 chr1 [363601, 367239] + | 363606 367239
NM_199558 chr1 [363604, 367325] + | 363671 367325
NM_001044283 chr1 [440183, 441303] + | 65252835 65294349
... ... ... ... ... ... ...
NM_001090227 chrX [123325049, 123390403] - | 123325048 123390403
NM_001102830 chrX [123325049, 123390403] - | 123325048 123390403
NM_001135068 chrX [123325049, 123390403] - | 123325048 123390403
NM_001012575 chrX [123325049, 123408056] - | 123325048 123408056
NM_001184797 chrX [123326032, 123438981] - | 123326060 123408176
Tx Gene Nexons
<character> <character> <integer>
NM_011767 NM_011767 Zfr 30
NM_001011177 NM_001011177 zfr 16
NM_001090507 NM_001090507 zfr 17
NM_199558 NM_199558 zfr 15
NM_001044283 NM_001044283 Snx3 10
... ... ... ...
NM_001090227 NM_001090227 MGC68497 6
NM_001102830 NM_001102830 tmlhe 7
NM_001135068 NM_001135068 tmlhe 7
NM_001012575 NM_001012575 TMLHE 7
NM_001184797 NM_001184797 TMLHE 7
Exons
<IRangesList>
NM_011767 [75038283, 75038380] [75038682, 75038775] [75056588, 75056878] ...
NM_001011177 [363600, 363660] [363682, 363706] [363729, 364308] ...
NM_001090507 [363600, 363660] [363682, 363706] [363729, 364308] ...
NM_199558 [363603, 363703] [363723, 364293] [364310, 364412] ...
NM_001044283 [65252221, 65252345] [65252377, 65252423] [65252439, 65252532] ...
... ...
NM_001090227 [123325048, 123325182] [123325850, 123325985] [123368412, 123368641] ...
NM_001102830 [123325048, 123325182] [123325850, 123325985] [123368412, 123368647] ...
NM_001135068 [123325048, 123325182] [123325850, 123325985] [123368412, 123368647] ...
NM_001012575 [123325048, 123325177] [123325851, 123325987] [123368414, 123368651] ...
NM_001184797 [123326031, 123326136] [123368413, 123368651] [123372057, 123372177] ...
-------
seqinfo: 3268 sequences (1 circular) from canFam3 genome
I would love advice about how to proceed -- is bumphunter behaving as expected (and it's my problem) or is this a bug?
Thanks, Jessica
Hi Jessica,
I think that it would be best if you make a separate issue for your specific scenario. This thread was meant to be about how to actually support other genomes. I know that your use case is related, but it'll be confusing to have both discussions in the same thread.
Cheers, Leo
As for the implementation, I don't know about the functions that @martinaryee mentioned from VariantAnnotation
. So I don't know if that's the way to go.
Using current functions, known_transcripts()
would need to be adapted to run code like in @jphekman's first code chunk: she is using a TxDb
object. As currently implemented, annotateNearest(subject)
has different behaviors if you supply a GRanges
object or a character vector as illustrated in https://support.bioconductor.org/p/62781/#62794. Hence why @jphekman's output doesn't match what is described in the docs.
I could give it a try to implement my suggestions. However, before doing so it would be great if there were unit tests for bumphunter
's functions so I can check that I'm not breaking other parts of it. I saw some tests for other parts of the package, but not the annotation-related functions at https://github.com/ririzarr/bumphunter/tree/master/inst/unitTests
Yeah, I debated about whether to add that information here. I moved it back to the bioconductor support forum as I don't think it's an actual bug. https://support.bioconductor.org/p/63231
Ah, thank you, Leo, I do see now that you had previously identified this issue with annotateNearest() not using the annotation when handed a transcripts object.
I made a bunch of changes. Now you can use any annotation package. it will break old code but the existing code needed to be changed.
Awesome! I'll change derfinder
and related packages soon to adapt to this version of bumphunter
.
Hi,
From answering https://support.bioconductor.org/p/62551/ I think that
bumphunter
should include some docs on how to useannotateNearest()
with an annotation other than hg19. Looking at the code https://github.com/ririzarr/bumphunter/blob/master/R/nearestgene.R#L24, I see that for hg19 you end up using a GRanges object that seems like others could create given some basic docs.At a quick glance, it looks similar to what you get from
TxDb
objects.Cheers, Leo