rnabioco / valr

Genome Interval Arithmetic in R
http://rnabioco.github.io/valr/
Other
88 stars 25 forks source link

reimplement closest #404

Closed kriemo closed 1 year ago

kriemo commented 1 year ago

When processing larger datasets bed_closest sometimes reports the incorrect interval due to complexities in choosing the correct search path in the interval tree. An approach could be written to parse the interval tree, but i believe it will require querying multiple search paths through the tree to ensure correctness. I've instead rewritten bed_closest to use a binary search approach, which has similar performance, and isn't as complicated to debug/maintain. This approach was inspired by the IRanges implementation written mostly in R.

library(valr)
genome <- read_genome(valr_example('hg19.chrom.sizes.gz'))
y <- bed_random(genome[1, ], length = 100, n = 1e6, seed = 10104)
y$idx <- seq_len(nrow(y))

x <- tibble::tribble(
  ~chrom,     ~start,       ~end, ~idx,
  "chr1", 135767933L, 135768033L,   3L
  )

bed_closest(x, y)
#> # A tibble: 1 × 9
#>   chrom   start.x     end.x idx.x   start.y     end.y  idx.y .overlap .dist
#>   <chr>     <int>     <int> <int>     <int>     <int>  <int>    <int> <int>
#> 1 chr1  135767933 135768033     3 135767712 135767812 545180        0  -122

# smaller interval tree (from y) reports correct interval
bed_closest(x, y[500000:600000, ])
#> # A tibble: 1 × 9
#>   chrom   start.x     end.x idx.x   start.y     end.y  idx.y .overlap .dist
#>   <chr>     <int>     <int> <int>     <int>     <int>  <int>    <int> <int>
#> 1 chr1  135767933 135768033     3 135768151 135768251 545181        0   119

Created on 2023-04-07 with reprex v2.0.2