rafalab / bumphunter

bumphunter
15 stars 14 forks source link

Docs for supporting non-hg19 genomes #1

Closed lcolladotor closed 9 years ago

lcolladotor commented 9 years ago

Hi,

From answering https://support.bioconductor.org/p/62551/ I think that bumphunter should include some docs on how to use annotateNearest() with an annotation other than hg19. Looking at the code https://github.com/ririzarr/bumphunter/blob/master/R/nearestgene.R#L24, I see that for hg19 you end up using a GRanges object that seems like others could create given some basic docs.

> library(bumphunter)
> data(TT, package = "bumphunter", envir = environment())
> TT$transcripts
GRanges object with 73432 ranges and 8 metadata columns:
             seqnames               ranges strand   |       CSS       CSE
                <Rle>            <IRanges>  <Rle>   | <integer> <integer>
  uc002qsd.4    chr19 [58858172, 58864865]      -   |  58858388  58864803
  uc002qsf.2    chr19 [58859832, 58874214]      -   |      <NA>      <NA>
  uc003wyw.1     chr8 [18248755, 18258723]      +   |  18257514  18258386
  uc002xmj.3    chr20 [43248163, 43280376]      -   |  43248475  43280248
  uc010xbn.1    chr18 [25530930, 25616539]      -   |  25532117  25616512
         ...      ...                  ...    ... ...       ...       ...
  uc011dzz.2     chr6 [90539619, 90584155]      +   |  90556309  90583528
  uc003bma.3    chr22 [50961997, 50964033]      -   |  50962040  50962840
  uc003blz.4    chr22 [50961997, 50964034]      -   |  50962040  50962840
  uc021wrz.1    chr22 [50961997, 50964570]      -   |  50962040  50962840
  uc021wsa.1    chr22 [50961997, 50964905]      -   |  50962040  50962840
                      Tx Entrez     Gene
             <character>  <Rle>    <Rle>
  uc002qsd.4  uc002qsd.4      1     A1BG
  uc002qsf.2  uc002qsf.2      1     A1BG
  uc003wyw.1  uc003wyw.1     10     NAT2
  uc002xmj.3  uc002xmj.3    100      ADA
  uc010xbn.1  uc010xbn.1   1000     CDH2
         ...         ...    ...      ...
  uc011dzz.2  uc011dzz.2   9994 CASP8AP2
  uc003bma.3  uc003bma.3   9997     SCO2
  uc003blz.4  uc003blz.4   9997     SCO2
  uc021wrz.1  uc021wrz.1   9997     SCO2
  uc021wsa.1  uc021wsa.1   9997     SCO2
                                                                                                        Refseq
                                                                                                         <Rle>
  uc002qsd.4                                                                               NM_130786 NP_570602
  uc002qsf.2                                                                               NM_130786 NP_570602
  uc003wyw.1                                                                               NM_000015 NP_000006
  uc002xmj.3                 NM_000022 NP_000013 XM_005260236 XM_006723679 XP_005260293 XP_006723742 XR_244129
  uc010xbn.1                           NM_001792 NP_001783 XM_005258181 XM_005258182 XP_005258238 XP_005258239
         ...                                                                                               ...
  uc011dzz.2                           NM_001137667 NM_001137668 NM_012115 NP_001131139 NP_001131140 NP_036247
  uc003bma.3 NM_001169109 NM_001169110 NM_001169111 NM_005138 NP_001162580 NP_001162581 NP_001162582 NP_005129
  uc003blz.4 NM_001169109 NM_001169110 NM_001169111 NM_005138 NP_001162580 NP_001162581 NP_001162582 NP_005129
  uc021wrz.1 NM_001169109 NM_001169110 NM_001169111 NM_005138 NP_001162580 NP_001162581 NP_001162582 NP_005129
  uc021wsa.1 NM_001169109 NM_001169110 NM_001169111 NM_005138 NP_001162580 NP_001162581 NP_001162582 NP_005129
                Nexons
             <integer>
  uc002qsd.4         8
  uc002qsf.2        11
  uc003wyw.1         2
  uc002xmj.3        12
  uc010xbn.1        15
         ...       ...
  uc011dzz.2        10
  uc003bma.3         2
  uc003blz.4         2
  uc021wrz.1         2
  uc021wsa.1         2
                                                                          Exons
                                                                  <IRangesList>
  uc002qsd.4 [58858172, 58858395] [58858719, 58859006] [58861736, 58862017] ...
  uc002qsf.2 [58859832, 58860494] [58860934, 58862017] [58862757, 58863053] ...
  uc003wyw.1                          [18248755, 18248855] [18257508, 18258723]
  uc002xmj.3 [43248163, 43248488] [43248940, 43249042] [43249659, 43249788] ...
  uc010xbn.1 [25530930, 25532323] [25543321, 25543485] [25562908, 25563047] ...
         ...                                                                ...
  uc011dzz.2 [90539619, 90539653] [90556281, 90556363] [90562885, 90562953] ...
  uc003bma.3                          [50961997, 50962853] [50963871, 50964033]
  uc003blz.4                          [50961997, 50962853] [50963901, 50964034]
  uc021wrz.1                          [50961997, 50962853] [50964430, 50964570]
  uc021wsa.1                          [50961997, 50962853] [50964675, 50964905]
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
> packageVersion('bumphunter')
[1] ‘1.6.0’

At a quick glance, it looks similar to what you get from TxDb objects.

Cheers, Leo

lcolladotor commented 9 years ago

See also https://support.bioconductor.org/p/62781/

lcolladotor commented 9 years ago

As I said here, I think that with some docs on what the metadata columns mean, users could construct the GRanges object with the necessary metadata. Then, using it, they can get the equivalent output to subject='hg19' for their genome.

Alternatively, a function that takes a TxDb object and creates a GRanges object with the proper metadata would work too.

martinaryee commented 9 years ago

bumphunter:::known_transcripts() seems to process a TxDb UCSC knownGene package and returns a list in the same format as the built-in TT object. For example, to generate a transcript database from the more recent human GRCh38 using the TxDb.Hsapiens.UCSC.hg38.knownGene package:

> txHs38 = bumphunter:::known_transcripts(species = "Hsapiens", build="hg38")

One idea would be to allow the annotateNearest subject argument to accept these objects. i.e.:

> gr <- GRanges('chr1', IRanges(1, 5000))
> annotateNearest(gr, subject=txHs38)

Like lcolladotor suggests, we could also generalize bumphunter:::knowntranscripts to allow TxDb objects other than those in the TxDb..UCSC._.knownGene packages.

What do people think?

rafalab commented 9 years ago

I think generalizing is the way to go. However, it should default to whatever we default now.

kasperdanielhansen commented 9 years ago

Supporting all TxDb objects makes total sense; it is probably easy. But I have not used this function for a long while, so I am not volunteering.

Best, Kasper

On Sat, Nov 22, 2014 at 1:29 PM, ririzarr notifications@github.com wrote:

I think generalizing is the way to go. However, it should default to whatever we default now.

— Reply to this email directly or view it on GitHub https://github.com/ririzarr/bumphunter/issues/1#issuecomment-64090216.

martinaryee commented 9 years ago

Instead of extending the annotation functionality of annotateNearest, we could consider instead adding a wrapper around existing annotation tools provided by other packages. For example, using VariantAnnotation with a standard TxDB object:

> library("VariantAnnotation")
> library("TxDb.Hsapiens.UCSC.hg19.knownGene")
> gr <- GRanges(c('chr10', 'chr10', 'chr3'), IRanges(c(131255354, 131265354, 37035288), c(131255554, 131265554, 37035368)))
> ann <- locateVariants(gr, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants())
> ann
GRanges object with 16 ranges and 9 metadata columns:
       seqnames                 ranges strand   |   LOCATION  LOCSTART    LOCEND   QUERYID      TXID     CDSID      GENEID
          <Rle>              <IRanges>  <Rle>   |   <factor> <integer> <integer> <integer> <integer> <integer> <character>
   [1]    chr10 [131255354, 131255554]      *   | intergenic      <NA>      <NA>         1      <NA>      <NA>        <NA>
   [2]    chr10 [131265354, 131265554]      +   |   promoter      <NA>      <NA>         2     39172      <NA>        4255
   [3]     chr3 [ 37035288,  37035368]      +   |     intron       134       214         3     13445      <NA>        4292
   [4]     chr3 [ 37035288,  37035368]      +   |     intron       134       214         3     13446      <NA>        4292
   [5]     chr3 [ 37035288,  37035368]      +   |     intron       134       214         3     13447      <NA>        4292
   ...      ...                    ...    ... ...        ...       ...       ...       ...       ...       ...         ...
  [12]     chr3   [37035288, 37035368]      +   |    fiveUTR        21       101         3     13451      <NA>        4292
  [13]     chr3   [37035288, 37035368]      +   |   promoter      <NA>      <NA>         3     13451      <NA>        4292
  [14]     chr3   [37035288, 37035368]      +   |    fiveUTR        21       101         3     13452      <NA>        4292
  [15]     chr3   [37035288, 37035368]      +   |   promoter      <NA>      <NA>         3     13452      <NA>        4292
  [16]     chr3   [37035288, 37035368]      -   |   promoter      <NA>      <NA>         3     15577      <NA>        9852
                        PRECEDEID        FOLLOWID
                  <CharacterList> <CharacterList>
   [1] 100422873,10539,119437,...                
   [2]                                           
   [3]                                           
   [4]                                           
   [5]                                           
   ...                        ...             ...
  [12]                                           
  [13]                                           
  [14]                                           
  [15]                                           
  [16]                                           
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

We could add a simple wrapper that processes this annotated GRranges object to contain a single output range per input range, as annotateNearest does. An advantage of this approach is that we don't have to duplicate annotation functionality that exists elsewhere in Bioconductor.

jphekman commented 9 years ago

Hi all. I have successfully created a dogTT object, similar to the TT object that comes with bumphunter. I'd love to be able to hand this object to annotateNearest() so that I can use the built-in annotation functions in derfinder. I'm trying to get a feel for what the status is on bumphunter's end. Can I help in any way? I'm relatively new to R but I have experience in other languages; I don't know bumphunter at all. So I can try to take a crack at doing some of this work, but I don't know whether I'm offering to get in over my head. Thanks, Jessica

jphekman commented 9 years ago

So I took at look at the source for annotateNearest, and it appeared that it would do what I wanted without modifications. I built my transcripts object like so:

library(TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene)
library(RMySQL)
library(bumphunter)
tx <- transcripts(TxDb.Cfamiliaris.UCSC.canFam3.xenoRefGene)
names(tx) <- unlist(mcols(tx)[,2])

con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user="genome", dbname="canFam3")
refGene <- dbGetQuery(con, "select * from xenoRefGene;")
refGene <- refGene[match(names(tx), refGene$name),]
Nexons = sapply(strsplit(refGene$exonStarts, ","), length)
Exons <- IRangesList(start = lapply(strsplit(refGene$exonStarts, ","), function(x) x[x != "" & !is.na(x)]), end = lapply(strsplit(refGene$exonEnds, ","), function(x) x[x != "" & !is.na(x)]))
mcols(tx) <- DataFrame(CSS = refGene$cdsStart, CSE = refGene$cdsEnd, Tx = refGene$name, Gene = refGene$name2, Nexons = Nexons, Exons = Exons)

I annotated like so:

annotation <- annotateNearest(regions$regions, tx)

(where regions$regions is derfinder output -- I am happy to provide more info there if anyone is curious).

The results are somewhat annotated:

> head(annotation)
dist matchIndex   type amountOverlap insideDist size1  size2
1    0      15454 inside            NA         59   125   5819
2    0      10018 inside            NA      -1811   141  15559
3    0      10017 inside            NA     -34060   118  68867
4    0      13154 inside            NA     -22676    93 169910
5    0        876 inside            NA     -30307    59 157801
6    0      13167 inside            NA       2095   114  12769

I am surprised that the annotation does not include the gene name or RefSeq ID, as the docs suggest it will. My suspicion is that I did not hand enough information along in my tx object, but the object does contain seqnames, and those weren't passed back in the annotation object.

> tx
GRanges object with 254017 ranges and 6 metadata columns:
               seqnames                 ranges strand   |       CSS       CSE
                  <Rle>              <IRanges>  <Rle>   | <numeric> <numeric>
     NM_011767     chr1       [363561, 368894]      +   |  75038343  75122798
  NM_001011177     chr1       [363601, 367239]      +   |    363606    367239
  NM_001090507     chr1       [363601, 367239]      +   |    363606    367239
     NM_199558     chr1       [363604, 367325]      +   |    363671    367325
  NM_001044283     chr1       [440183, 441303]      +   |  65252835  65294349
           ...      ...                    ...    ... ...       ...       ...
  NM_001090227     chrX [123325049, 123390403]      -   | 123325048 123390403
  NM_001102830     chrX [123325049, 123390403]      -   | 123325048 123390403
  NM_001135068     chrX [123325049, 123390403]      -   | 123325048 123390403
  NM_001012575     chrX [123325049, 123408056]      -   | 123325048 123408056
  NM_001184797     chrX [123326032, 123438981]      -   | 123326060 123408176
                         Tx        Gene    Nexons
                <character> <character> <integer>
     NM_011767    NM_011767         Zfr        30
  NM_001011177 NM_001011177         zfr        16
  NM_001090507 NM_001090507         zfr        17
     NM_199558    NM_199558         zfr        15
  NM_001044283 NM_001044283        Snx3        10
           ...          ...         ...       ...
  NM_001090227 NM_001090227    MGC68497         6
  NM_001102830 NM_001102830       tmlhe         7
  NM_001135068 NM_001135068       tmlhe         7
  NM_001012575 NM_001012575       TMLHE         7
  NM_001184797 NM_001184797       TMLHE         7
                                                                                  Exons
                                                                          <IRangesList>
     NM_011767       [75038283, 75038380] [75038682, 75038775] [75056588, 75056878] ...
  NM_001011177                   [363600, 363660] [363682, 363706] [363729, 364308] ...
  NM_001090507                   [363600, 363660] [363682, 363706] [363729, 364308] ...
     NM_199558                   [363603, 363703] [363723, 364293] [364310, 364412] ...
  NM_001044283       [65252221, 65252345] [65252377, 65252423] [65252439, 65252532] ...
           ...                                                                      ...
  NM_001090227 [123325048, 123325182] [123325850, 123325985] [123368412, 123368641] ...
  NM_001102830 [123325048, 123325182] [123325850, 123325985] [123368412, 123368647] ...
  NM_001135068 [123325048, 123325182] [123325850, 123325985] [123368412, 123368647] ...
  NM_001012575 [123325048, 123325177] [123325851, 123325987] [123368414, 123368651] ...
  NM_001184797 [123326031, 123326136] [123368413, 123368651] [123372057, 123372177] ...
  -------
  seqinfo: 3268 sequences (1 circular) from canFam3 genome

I would love advice about how to proceed -- is bumphunter behaving as expected (and it's my problem) or is this a bug?

Thanks, Jessica

lcolladotor commented 9 years ago

Hi Jessica,

I think that it would be best if you make a separate issue for your specific scenario. This thread was meant to be about how to actually support other genomes. I know that your use case is related, but it'll be confusing to have both discussions in the same thread.

Cheers, Leo

lcolladotor commented 9 years ago

As for the implementation, I don't know about the functions that @martinaryee mentioned from VariantAnnotation. So I don't know if that's the way to go.

Using current functions, known_transcripts() would need to be adapted to run code like in @jphekman's first code chunk: she is using a TxDb object. As currently implemented, annotateNearest(subject) has different behaviors if you supply a GRanges object or a character vector as illustrated in https://support.bioconductor.org/p/62781/#62794. Hence why @jphekman's output doesn't match what is described in the docs.

I could give it a try to implement my suggestions. However, before doing so it would be great if there were unit tests for bumphunter's functions so I can check that I'm not breaking other parts of it. I saw some tests for other parts of the package, but not the annotation-related functions at https://github.com/ririzarr/bumphunter/tree/master/inst/unitTests

jphekman commented 9 years ago

Yeah, I debated about whether to add that information here. I moved it back to the bioconductor support forum as I don't think it's an actual bug. https://support.bioconductor.org/p/63231

jphekman commented 9 years ago

Ah, thank you, Leo, I do see now that you had previously identified this issue with annotateNearest() not using the annotation when handed a transcripts object.

rafalab commented 9 years ago

I made a bunch of changes. Now you can use any annotation package. it will break old code but the existing code needed to be changed.

lcolladotor commented 9 years ago

Awesome! I'll change derfinder and related packages soon to adapt to this version of bumphunter.