rnabioco / valr

Genome Interval Arithmetic in R
http://rnabioco.github.io/valr/
Other
88 stars 25 forks source link

[bed_intersect] consider warning user when input data is passed multiple times through %>% operator #412

Closed kriemo closed 4 months ago

kriemo commented 5 months ago

It's possible for the user to mistakenly pass a dataset twice to bed_intersect() when using the %>% operator. This happens when an intermediate function is called within the bed_intersect call. The data gets passed implicitly to the x argument, then secondly as a dots (...) argument. I'm not sure what options we have to detect this on our end, and for some users this might be a feature rather than a bug. The native pipe operator won't allow this, as the data/placeholder can't be passed twice.

library(valr)
library(dplyr)
library(tibble)

x <- tribble(
     ~chrom, ~start, ~end, ~strand,
    "1", 0L, 5L,  "+"
    )

# data from x interpreted as two inputs
x %>%
  bed_intersect(group_by(., strand))
#> # A tibble: 1 × 9
#>   chrom start.x end.x strand.x start.y end.y strand.y .source .overlap
#>   <chr>   <int> <int> <chr>      <int> <int> <chr>    <chr>      <int>
#> 1 1           0     5 +              0     5 +        1              5

# is equivalent to
bed_intersect(x, group_by(x, strand))
#> # A tibble: 1 × 9
#>   chrom start.x end.x strand.x start.y end.y strand.y .source .overlap
#>   <chr>   <int> <int> <chr>      <int> <int> <chr>    <chr>      <int>
#> 1 1           0     5 +              0     5 +        1              5

# this behavior can be confusing if intersecting with another tibble
y <- tribble(
  ~chrom, ~start, ~end, ~nonsense, ~strand,
  "XX", 100L, 500L,  "hello!", "-"
)

x %>%
  bed_intersect(group_by(., strand), group_by(y, strand))
#> # A tibble: 1 × 10
#>   chrom start.x end.x strand.x start.y end.y strand.y nonsense.y .source
#>   <chr>   <int> <int> <chr>      <int> <int> <chr>    <chr>      <chr>  
#> 1 1           0     5 +              0     5 +        <NA>       1      
#> # ℹ 1 more variable: .overlap <int>

# is equivalent to:
bed_intersect(x, group_by(x, strand), group_by(y, strand))
#> # A tibble: 1 × 10
#>   chrom start.x end.x strand.x start.y end.y strand.y nonsense.y .source
#>   <chr>   <int> <int> <chr>      <int> <int> <chr>    <chr>      <chr>  
#> 1 1           0     5 +              0     5 +        <NA>       1      
#> # ℹ 1 more variable: .overlap <int>

This doesn't happen with the native pipe, as you can't (implicitly or explicitly) pass the data twice.

x |> bed_intersect(group_by(.data = _, strand))
Error in bed_intersect(x, group_by(.data = "_", strand)) : 
  invalid use of pipe placeholder (<input>:1:0)

Created on 2024-04-03 with reprex v2.1.0

kriemo commented 4 months ago

closing, as i think the current behavior is desirable.