rnabioco / valr

Genome Interval Arithmetic in R
http://rnabioco.github.io/valr/
Other
88 stars 25 forks source link

bed_closest giving too many results #376

Closed kbauerm closed 3 years ago

kbauerm commented 3 years ago

When running bed-closest, I am getting more rows in the output than there were in the input - there should be 3414 observations, and instead I'm getting 3511.

nearby <- bed_closest(bed1, bed2)
kriemo commented 3 years ago

Could you make a reproducible example from your input files? One idea would be to attach the bed1 and bed2 R objects as rds files to this issue, or by making a minimal reprex https://github.com/jennybc/reprex#what-is-a-reprex ). Thanks.

kriemo commented 3 years ago

bed_closest will report ties in the closest intervals, as well as report overlaping intervals, which may explain the discrepancy.

e.g

library(valr)
library(tibble)
x <- tibble::tribble(
  ~chrom, ~start, ~end,
  "chr1", 500,    600,
  "chr2", 5000,   6000
)

y <- tibble::tribble(
  ~chrom, ~start, ~end,
  "chr1", 100,    200,
  "chr1", 150,    200,
  "chr1", 550,    580,
  "chr2", 7000,   8500
)

bed_closest(x, y)
#> # A tibble: 4 x 7
#>   chrom start.x end.x start.y end.y .overlap .dist
#>   <chr>   <dbl> <dbl>   <dbl> <dbl>    <int> <int>
#> 1 chr1      500   600     550   580       30     0
#> 2 chr1      500   600     100   200        0  -301
#> 3 chr1      500   600     150   200        0  -301
#> 4 chr2     5000  6000    7000  8500        0  1001

bed_closest(x, y, overlap = FALSE)
#> # A tibble: 3 x 6
#>   chrom start.x end.x start.y end.y .dist
#>   <chr>   <dbl> <dbl>   <dbl> <dbl> <int>
#> 1 chr1      500   600     100   200  -301
#> 2 chr1      500   600     150   200  -301
#> 3 chr2     5000  6000    7000  8500  1001

Created on 2021-05-26 by the reprex package (v0.3.0)

kriemo commented 3 years ago

Closing this issue for now, feel free to reopen if you have a reproducible example of this issue.