tidyomics / plyranges

A grammar of genomic data transformation
https://tidyomics.github.io/plyranges/
137 stars 19 forks source link

[BUG] Add min.width argument for reduce_ranges #84

Open whtns opened 4 years ago

whtns commented 4 years ago

Context

I'd like to use reduce_ranges to calculate grouped summary stats. This is difficult with base granges tools. I also want to specify a minimum width argument of 0 bases.

Code

Include the code you ran and comments

library(tibble)
library(plyranges)

# setup granges
start_granges <- tibble(seqnames = rep(1, 8),
                       start = rep(c(1, 2, 11, 22), 2),
                       end = rep(c(10, 8, 20, 30), 2),
                       coverage = rep(c(10, 10, 20, 30), 2),
                       sample_id = c(rep("A", 4), rep("B", 4))) %>% 
  as_granges()

# summarize coverage by sample id with min.width = 1
ply_reduced_granges <- 
  start_granges %>% 
  group_by(sample_id) %>% 
  plyranges::reduce_ranges(coverage = sum(coverage))

# summarize coverage by sample id with min.width = 0
bioc_reduced_granges <- 
  start_granges %>% 
  split(.$sample_id) %>% 
  GRangesList()

bioc_reduced_granges <- lapply(bioc_reduced_granges, reduce, min.gapwidth = 0L, with.revmap = TRUE)

# this results in a complex granges object that makes summary stat calculations difficult

R session information

Remember to include your full R session information.

─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────── setting value
version R version 3.6.3 (2020-02-29) os Ubuntu 16.04.6 LTS
system x86_64, linux-gnu
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2020-07-16

─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── package version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
backports 1.1.8 2020-06-17 [1] CRAN (R 3.6.3)
Biobase 2.44.0 2019-05-02 [1] Bioconductor
BiocGenerics
0.30.0 2019-05-02 [1] Bioconductor
BiocParallel 1.18.1 2019-08-06 [1] Bioconductor
Biostrings 2.52.0 2019-05-02 [1] Bioconductor
bitops 1.0-6 2013-08-17 [1] CRAN (R 3.6.0)
callr 3.4.3 2020-03-28 [1] CRAN (R 3.6.3)
cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.2)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
DelayedArray 0.10.0 2019-05-02 [1] Bioconductor
desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
devtools 2.3.0 2020-04-10 [1] CRAN (R 3.6.3)
digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.2)
dplyr 1.0.0 2020-05-29 [1] CRAN (R 3.6.3)
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.6.3)
fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.0)
fs 1.4.1 2020-04-04 [1] CRAN (R 3.6.3)
generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0)
GenomeInfoDb
1.20.0 2019-05-02 [1] Bioconductor
GenomeInfoDbData 1.2.1 2020-05-06 [1] Bioconductor
GenomicAlignments 1.20.1 2019-06-18 [1] Bioconductor
GenomicRanges 1.36.1 2019-09-06 [1] Bioconductor
glue 1.4.1 2020-05-13 [1] CRAN (R 3.6.3)
googledrive
1.0.1 2020-05-05 [1] CRAN (R 3.6.3)
IRanges 2.18.3 2019-09-24 [1] Bioconductor
lattice 0.20-41 2020-04-02 [3] CRAN (R 3.6.3)
lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.3)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
MASS
7.3-51.6 2020-04-26 [3] CRAN (R 3.6.3)
Matrix 1.2-18 2019-11-27 [3] CRAN (R 3.6.1)
matrixStats 0.56.0 2020-03-13 [1] CRAN (R 3.6.3)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.3)
packrat 0.5.0 2018-11-14 [1] CRAN (R 3.6.0)
pillar 1.4.4 2020-05-05 [1] CRAN (R 3.6.3)
pkgbuild 1.0.8.9000 2020-07-01 [1] Github (r-lib/pkgbuild@ed57598) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
pkgload 1.1.0 2020-05-29 [1] CRAN (R 3.6.3)
plyranges 1.4.4 2019-09-17 [1] Bioconductor
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.0)
processx 3.4.2 2020-02-09 [1] CRAN (R 3.6.2)
ps 1.3.3 2020-05-08 [1] CRAN (R 3.6.3)
purrr 0.3.4 2020-04-17 [1] CRAN (R 3.6.3)
R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0)
Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 3.6.3)
RCurl 1.98-1.2 2020-04-18 [1] CRAN (R 3.6.3)
remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.2)
rlang 0.4.6 2020-05-02 [1] CRAN (R 3.6.3)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
Rsamtools 2.0.3 2019-10-10 [1] Bioconductor
rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.6.2)
rtracklayer 1.44.4 2019-09-06 [1] Bioconductor
S4Vectors
0.22.1 2019-09-09 [1] Bioconductor
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
SummarizedExperiment 1.14.1 2019-07-31 [1] Bioconductor
testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.2)
tibble 3.0.1 2020-04-20 [1] CRAN (R 3.6.3)
tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.3)
usethis
1.6.1 2020-04-29 [1] CRAN (R 3.6.3)
vctrs 0.3.1 2020-06-05 [1] CRAN (R 3.6.3)
withr 2.2.0 2020-04-20 [1] CRAN (R 3.6.3)
XML 3.99-0.3 2020-01-20 [1] CRAN (R 3.6.0)
XVector 0.24.0 2019-05-02 [1] Bioconductor
zlibbioc 1.30.0 2019-05-02 [1] Bioconductor

[1] /usr/local/lib/R/site-library [2] /usr/lib/R/site-library [3] /usr/lib/R/library

The output of sessioninfo::session_info() includes relevant GitHub installation information and other details that are missed by sessionInfo().

sa-lee commented 4 years ago

Thanks, so I definitely think we should include an option for this - but I'm a bit swamped with other things so it might be a while to look at it.

My initial thinking was users could do interval arithmetic, in your case you could emulate the gapwidth argument like this:

library(tibble)
library(plyranges)

# setup granges
start_granges <- tibble(seqnames = rep(1, 8),
                        start = rep(c(1, 2, 11, 22), 2),
                        end = rep(c(10, 8, 20, 30), 2),
                        coverage = rep(c(10, 10, 20, 30), 2),
                        sample_id = c(rep("A", 4), rep("B", 4))) %>% 
  as_granges()

start_granges %>% 
  mutate(width = width - 1) %>% # moves width but leaves start fixed, can be made explict with anchor_start()
  group_by(sample_id) %>% 
  reduce_ranges(coverage = sum(coverage)) %>% 
  mutate(end = end + 1)  # shift the end back
#> GRanges object with 6 ranges and 2 metadata columns:
#>       seqnames    ranges strand |   sample_id  coverage
#>          <Rle> <IRanges>  <Rle> | <character> <numeric>
#>   [1]        1      1-10      * |           A        20
#>   [2]        1     11-20      * |           A        20
#>   [3]        1     22-30      * |           A        30
#>   [4]        1      1-10      * |           B        20
#>   [5]        1     11-20      * |           B        20
#>   [6]        1     22-30      * |           B        30
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
whtns commented 4 years ago

Thanks! that will work for the moment, I think. I'll try to PR if I get the chance. Do you think adding this option would impact a lot of other parts of plyranges?

sa-lee commented 4 years ago

I don't think so - I think it just requires updating the reduce_ranges signature to allow for additional arguments to reduce via .... I should have some time in the next few days to look at it.

mikelove commented 10 months ago

I took a shot at this here:

https://github.com/mikelove/plyranges/commit/a0dd1aba57ade82c2624665d5f97d895605b93f4

If this looks alright, I can add an example and tests.

> x <- data.frame(seqnames="1", start=c(1,11,21), width=c(5,10,5)) %>% as_granges()
> x
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1       1-5      *
  [2]        1     11-20      *
  [3]        1     21-25      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> x %>% reduce_ranges(min.gapwidth=5L)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1       1-5      *
  [2]        1     11-25      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> x %>% reduce_ranges(min.gapwidth=6L)
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1      1-25      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths