tidyomics / plyranges

A grammar of genomic data transformation
https://tidyomics.github.io/plyranges/
140 stars 18 forks source link

[BUG] Character List columns cause an error in join_overlap_left() #91

Open smped opened 3 years ago

smped commented 3 years ago

Hi Stuart,

Hope things are going well & I'm still finding this to be such a useful package.

I've come across a problem with join_overlap_left() if the right ranges contain a CharacterList column, as might be output from reduce_ranges() depending on the function being used. If there is a CharacterList column, the fuction simply outputs the error:

Error: subscript contains NAs

As a minimal reproducible example:

library(plyranges)
x <- GRanges(c("chr1:1-10", "chr1:21-30")) 
y <- GRanges("chr1:25-30") %>% mutate(letter = CharacterList("a"))
join_overlap_left(x, y)
Error: subscript contains NAs

This produces the above error, however, the same error doesn't occur when using a generic S3 list column

y$letter <- as(y$letter, "list") 
join_overlap_left(x, y)

GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand | letter
         <Rle> <IRanges>  <Rle> | <list>
  [1]     chr1      1-10      * |       
  [2]     chr1     21-30      * |      a
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

I also noticed that I couldn't find a way to change the original CharacterList column into an S3 list using mutate(), but that might be a side issue.

y <- GRanges("chr1:25-30") %>% mutate(letter = CharacterList("a"))
mutate(y, letter = as(letter, "list"))

GRanges object with 1 range and 1 metadata column:
      seqnames    ranges strand |      letter
         <Rle> <IRanges>  <Rle> | <character>
  [1]     chr1     25-30      * |           a
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

This shouldn't (to my mind) produce a character column, but should return a list column. If the object is more complicated than my toy example, it can cause the data to fall apart pretty badly.

y <- GRanges(c("chr1:25-30", "chr1:101")) %>% 
  mutate(letter = CharacterList(list("a", c("b", "c")))) 
y %>%
  mutate(letter = as(letter, "list"))

GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand |      letter
         <Rle> <IRanges>  <Rle> | <character>
  [1]     chr1     25-30      * |           a
  [2]     chr1       101      * |           a
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
Warning message:
In recycleSingleBracketReplacementValue(value, x, nsbs) :
  number of values supplied is not a sub-multiple of the number of values to be replaced

Hopefully that's not too much information.

Cheers,

Steve

R session information

─ Session info ──────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.1.0 (2021-05-18)
 os       Ubuntu 20.04.2 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  C.UTF-8                     
 ctype    C.UTF-8                     
 tz       Australia/Adelaide          
 date     2021-07-02                  

─ Packages ──────────────────────────────────────────────────────────────────
 package              * version  date       lib source        
 assertthat             0.2.1    2019-03-21 [1] CRAN (R 4.1.0)
 Biobase                2.52.0   2021-05-19 [1] Bioconductor  
 BiocGenerics         * 0.38.0   2021-05-19 [1] Bioconductor  
 BiocIO                 1.2.0    2021-05-19 [1] Bioconductor  
 BiocManager            1.30.16  2021-06-15 [1] CRAN (R 4.1.0)
 BiocParallel           1.26.0   2021-05-19 [1] Bioconductor  
 Biostrings             2.60.1   2021-06-06 [1] Bioconductor  
 bitops                 1.0-7    2021-04-24 [1] CRAN (R 4.1.0)
 cli                    3.0.0    2021-06-30 [1] CRAN (R 4.1.0)
 crayon                 1.4.1    2021-02-08 [1] CRAN (R 4.1.0)
 DBI                    1.1.1    2021-01-15 [1] CRAN (R 4.1.0)
 DelayedArray           0.18.0   2021-05-19 [1] Bioconductor  
 digest                 0.6.27   2020-10-24 [1] CRAN (R 4.1.0)
 dplyr                  1.0.7    2021-06-18 [1] CRAN (R 4.1.0)
 ellipsis               0.3.2    2021-04-29 [1] CRAN (R 4.1.0)
 evaluate               0.14     2019-05-28 [1] CRAN (R 4.1.0)
 fansi                  0.5.0    2021-05-25 [1] CRAN (R 4.1.0)
 fs                     1.5.0    2020-07-31 [1] CRAN (R 4.1.0)
 generics               0.1.0    2020-10-31 [1] CRAN (R 4.1.0)
 GenomeInfoDb         * 1.28.0   2021-05-19 [1] Bioconductor  
 GenomeInfoDbData       1.2.6    2021-06-28 [1] Bioconductor  
 GenomicAlignments      1.28.0   2021-05-19 [1] Bioconductor  
 GenomicRanges        * 1.44.0   2021-05-19 [1] Bioconductor  
 glue                   1.4.2    2020-08-27 [1] CRAN (R 4.1.0)
 htmltools              0.5.1.1  2021-01-22 [1] CRAN (R 4.1.0)
 httpuv                 1.6.1    2021-05-07 [1] CRAN (R 4.1.0)
 IRanges              * 2.26.0   2021-05-19 [1] Bioconductor  
 knitr                  1.33     2021-04-24 [1] CRAN (R 4.1.0)
 later                  1.2.0    2021-04-23 [1] CRAN (R 4.1.0)
 lattice                0.20-44  2021-05-02 [4] CRAN (R 4.1.0)
 lifecycle              1.0.0    2021-02-15 [1] CRAN (R 4.1.0)
 magrittr               2.0.1    2020-11-17 [1] CRAN (R 4.1.0)
 Matrix                 1.3-4    2021-06-01 [4] CRAN (R 4.1.0)
 MatrixGenerics         1.4.0    2021-05-19 [1] Bioconductor  
 matrixStats            0.59.0   2021-06-01 [1] CRAN (R 4.1.0)
 pillar                 1.6.1    2021-05-16 [1] CRAN (R 4.1.0)
 pkgconfig              2.0.3    2019-09-22 [1] CRAN (R 4.1.0)
 plyranges            * 1.12.1   2021-06-29 [1] Bioconductor  
 promises               1.2.0.1  2021-02-11 [1] CRAN (R 4.1.0)
 purrr                  0.3.4    2020-04-17 [1] CRAN (R 4.1.0)
 R6                     2.5.0    2020-10-28 [1] CRAN (R 4.1.0)
 Rcpp                   1.0.6    2021-01-15 [1] CRAN (R 4.1.0)
 RCurl                  1.98-1.3 2021-03-16 [1] CRAN (R 4.1.0)
 restfulr               0.0.13   2017-08-06 [1] CRAN (R 4.1.0)
 rjson                  0.2.20   2018-06-08 [1] CRAN (R 4.1.0)
 rlang                  0.4.11   2021-04-30 [1] CRAN (R 4.1.0)
 rmarkdown              2.9      2021-06-15 [1] CRAN (R 4.1.0)
 Rsamtools              2.8.0    2021-05-19 [1] Bioconductor  
 rstudioapi             0.13     2020-11-12 [1] CRAN (R 4.1.0)
 rtracklayer            1.52.0   2021-05-19 [1] Bioconductor  
 S4Vectors            * 0.30.0   2021-05-19 [1] Bioconductor  
 sessioninfo            1.1.1    2018-11-05 [1] CRAN (R 4.1.0)
 SummarizedExperiment   1.22.0   2021-05-19 [1] Bioconductor  
 tibble                 3.1.2    2021-05-16 [1] CRAN (R 4.1.0)
 tidyselect             1.1.1    2021-04-30 [1] CRAN (R 4.1.0)
 utf8                   1.2.1    2021-03-12 [1] CRAN (R 4.1.0)
 vctrs                  0.3.8    2021-04-29 [1] CRAN (R 4.1.0)
 withr                  2.4.2    2021-04-18 [1] CRAN (R 4.1.0)
 workflowr            * 1.6.2    2020-04-30 [1] CRAN (R 4.1.0)
 xfun                   0.24     2021-06-15 [1] CRAN (R 4.1.0)
 XML                    3.99-0.6 2021-03-16 [1] CRAN (R 4.1.0)
 XVector                0.32.0   2021-05-19 [1] Bioconductor  
 yaml                   2.2.1    2020-02-01 [1] CRAN (R 4.1.0)
 zlibbioc               1.38.0   2021-05-19 [1] Bioconductor  
smped commented 3 years ago

I should also add that I tracked it down to the following line from .join_overlap_left()

mcols_outer <- na_dframe(mcols(right), sum(only_left))

Might save you a few minutes while debugging

sa-lee commented 3 years ago

Thanks for the report Steve, I'll try to get to this one on the weekend :)

hw538 commented 2 years ago

the same bug happened when the meta col is 'DNAStringSet' , would you mind adding support to this? Thank you ~