rnabioco / valr

Genome Interval Arithmetic in R
http://rnabioco.github.io/valr/
Other
88 stars 25 forks source link

bed_cluster creates different clusters when other intervals included #401

Closed kcamnairb closed 1 year ago

kcamnairb commented 1 year ago

Hi, I found some strange output with bed_cluster where if an interval that is further away is included, other intervals are no longer clustered together. You can see in the example below that with max_dist set to 10, intervals 5-20 and 30-40 cluster together, but when interval 1-10 is included intervals 5-20 and 30-40 no longer cluster together.

library(tidyverse)
library(valr)
tibble::tribble(
  ~chrom, ~start, ~end,
  "chr1", 5,      20,
  "chr1", 30,     40
) %>%
bed_cluster(max_dist = 10)
## A tibble: 2 × 4
#  chrom start   end   .id
#  <chr> <dbl> <dbl> <int>
#1 chr1      5    20     1
#2 chr1     30    40     1
tibble::tribble(
  ~chrom, ~start, ~end,
  "chr1", 5,      20,
  "chr1", 30,     40,
  "chr1", 1,      10
) %>%
bed_cluster(max_dist = 10)
## A tibble: 3 × 4
#  chrom start   end   .id
#  <chr> <dbl> <dbl> <int>
#1 chr1      1    10     1
#2 chr1      5    20     1
#3 chr1     30    40     2
jayhesselberth commented 1 year ago

Thanks, just adding the reprex

library(tidyverse)
library(valr)

tibble::tribble(
  ~chrom, ~start, ~end,
  "chr1", 5,      20,
  "chr1", 30,     40
) %>%
  bed_cluster(max_dist = 10)
#> # A tibble: 2 × 4
#>   chrom start   end   .id
#>   <chr> <dbl> <dbl> <int>
#> 1 chr1      5    20     1
#> 2 chr1     30    40     1

tibble::tribble(
  ~chrom, ~start, ~end,
  "chr1", 5,      20,
  "chr1", 30,     40,
  "chr1", 1,      10
) %>%
  bed_cluster(max_dist = 10)
#> # A tibble: 3 × 4
#>   chrom start   end   .id
#>   <chr> <dbl> <dbl> <int>
#> 1 chr1      1    10     1
#> 2 chr1      5    20     1
#> 3 chr1     30    40     2

Created on 2023-04-05 with reprex v2.0.2

kriemo commented 1 year ago

Thanks for reporting this bug. This should now be fixed in the main branch, which you can install via devtools.

# install.packages("devtools")
devtools::install_github('rnabioco/valr')
kcamnairb commented 1 year ago

It works great! Thanks you.

kcamnairb commented 1 year ago

Sorry, I'm still having the same issue with different data. All the intervals below should cluster together.

library(tidyverse)
library(valr)
tibble::tribble(
        ~chrom, ~start,  ~end,
 "scaffold_66",  27262, 70396,
 "scaffold_66",  66594, 67647,
 "scaffold_66",  82218, 85280,
 "scaffold_66",  85878, 87553,
 "scaffold_66",  87831, 89885,
 "scaffold_66",  90498, 91996
                     ) %>%
  bed_cluster(max_dist = 20000)
#> # A tibble: 6 × 4
#>   chrom       start   end   .id
#>   <chr>       <dbl> <dbl> <int>
#> 1 scaffold_66 27262 70396     1
#> 2 scaffold_66 66594 67647     1
#> 3 scaffold_66 82218 85280     1
#> 4 scaffold_66 85878 87553     1
#> 5 scaffold_66 87831 89885     1
#> 6 scaffold_66 90498 91996     2 
kriemo commented 1 year ago

Thanks for reopening with the additional example. bed_cluster needs additional tests to avoid these bugs. Hopefully I can have a fix for you in the next few days.

kriemo commented 1 year ago

this should be fixed now, thanks again for reporting and please reopen if you find this issue unresolved on additional datasets.