tidyomics / plyranges

A grammar of genomic data transformation
https://tidyomics.github.io/plyranges/
140 stars 18 forks source link

Conversion of IRanges to tibble strips metadata #55

Closed iimog closed 5 years ago

iimog commented 5 years ago

When converting a tibble to IRanges and back to tibble the metadata gets lost (while for GRanges it is retained):

> tibble(seqnames="a",start=1,end=2, meta="foo") %>% as_iranges %>% as_tibble
# A tibble: 1 x 3
  start   end width
  <int> <int> <int>
1     1     2     2
> tibble(seqnames="a",start=1,end=2, meta="foo") %>% as_granges %>% as_tibble
# A tibble: 1 x 6
  seqnames start   end width strand meta 
  <fct>    <int> <int> <int> <fct>  <chr>
1 a            1     2     2 *      foo  

My use case is to convert two tibble to IRanges in order to merge their metadata on overlapping intervals. Then I need to do some other modifications (e.g. replace_na) for which I want to convert it back to tibble. My current workaround is to overwrite the as_tibble method manually like this:

as_tibble.IRanges <- function(x){
    as_tibble(bind_cols(as.data.frame(x), as.data.frame(x@elementMetadata)))
}

This works for me but feels a little hacky and assumes that elementMetadata is always ordered identically to the ranges themselves. Is this a safe assumption? For me it is a rather common workflow to convert to IRanges and back again and it feels quite naturally to me to do it as described above (with metadata included). Do you think this would be a good default behavior or are there good reasons to drop the metadata?

I'm using plyranges 1.3.2 on R version 3.5.1, my sessionInfo():

``` R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux buster/sid Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] plyranges_1.3.2 GenomicRanges_1.34.0 GenomeInfoDb_1.18.1 [4] IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0 [7] forcats_0.3.0 stringr_1.3.1 dplyr_0.7.8 [10] purrr_0.2.5 readr_1.2.1 tidyr_0.8.2 [13] tibble_1.4.2 ggplot2_3.1.0 tidyverse_1.2.1 loaded via a namespace (and not attached): [1] Rcpp_1.0.0 lubridate_1.7.4 [3] lattice_0.20-38 Rsamtools_1.34.0 [5] Biostrings_2.50.1 utf8_1.1.4 [7] assertthat_0.2.0 R6_2.3.0 [9] cellranger_1.1.0 plyr_1.8.4 [11] backports_1.1.2 httr_1.3.1 [13] pillar_1.3.0 zlibbioc_1.28.0 [15] rlang_0.3.0.1 curl_3.2 [17] lazyeval_0.2.1 readxl_1.1.0 [19] rstudioapi_0.8 Matrix_1.2-15 [21] BiocParallel_1.16.2 RCurl_1.95-4.11 [23] munsell_0.5.0 DelayedArray_0.8.0 [25] broom_0.5.0 compiler_3.5.1 [27] modelr_0.1.2 rtracklayer_1.42.1 [29] pkgconfig_2.0.2 tidyselect_0.2.5 [31] SummarizedExperiment_1.12.0 GenomeInfoDbData_1.2.0 [33] matrixStats_0.54.0 XML_3.98-1.16 [35] fansi_0.4.0 crayon_1.3.4 [37] withr_2.1.2 GenomicAlignments_1.18.0 [39] bitops_1.0-6 grid_3.5.1 [41] nlme_3.1-137 jsonlite_1.5 [43] gtable_0.2.0 magrittr_1.5 [45] scales_1.0.0 cli_1.0.1 [47] stringi_1.2.4 XVector_0.22.0 [49] remotes_2.0.2 bindrcpp_0.2.2 [51] xml2_1.2.0 tools_3.5.1 [53] Biobase_2.42.0 glue_1.3.0 [55] hms_0.4.2 colorspace_1.3-2 [57] BiocManager_1.30.4 rvest_0.3.2 [59] bindr_0.1.1 haven_2.0.0 ```
sa-lee commented 5 years ago

I think the conversion error may be coming from the IRanges package rather than plyranges here. I think this is due to there not being a as.data.frame method for IntegerRanges class. If we take the example from the IRanges constructor man page:

suppressPackageStartupMessages(library(IRanges))
ir1 <- IRanges(c(1, 10, 20), width=5)
mcols(ir1) <- DataFrame(score=runif(3))
as.data.frame(ir1)
#>   start end width
#> 1     1   5     5
#> 2    10  14     5
#> 3    20  24     5

We get the same issue. It'd be best to get this fixed on the IRanges end rather than implementing your own as.tibble method I think. I agree it's weird that the metadata gets silently dropped.

cc @lawremi would you be able to fix this? Is there a reason for this behaviour?

iimog commented 5 years ago

Thanks for looking into this. I fully agree, I'd rather see this fixed upstream than using my own as_tibble method. I'm happy to help/test if I can.

sa-lee commented 5 years ago

Hi @iimog I'm closing this as it's not relevant to plyranges I would recommend posting an issue on Bioconductor/IRanges