rafalab / bumphunter

bumphunter
15 stars 14 forks source link

regionFinder not sorting positions #27

Open kperzel opened 4 years ago

kperzel commented 4 years ago

Hi, I have been trying to use regionFinder to find statistic bumps in large genomic data sets, but I have noticed that the output doesn't make sense. Here is an example:

library(bumphunter)
statistic = c(5.552229, 5.548916, 5.518587, 5.486775, 5.478647, 5.472431, 5.469969, 5.450930,5.437189, 5.385867, 5.377567, 5.369858, 5.331892, 5.305152, 5.277105, 5.269921,5.256093, 5.203086)
methPos = c(2225407, 2225187, 2236547, 2225535, 2279596, 2225023, 2225019, 2225580, 2224969,2279661, 2237227, 2224879, 2224834, 2203900, 2202092, 2203918, 2224756, 2224708)
methChr = rep("chr12",18)
data = as.data.frame(cbind(statistic,methPos,methChr))
data$statistic = as.numeric(as.character(data$statistic))
data$methPos = as.numeric(as.character(data$methPos))
regionFinder(data$statistic, data$methChr, data$methPos, cutoff = 3.5)

The first couple lines of the output look like this:

    chr   start     end    value      area cluster indexStart indexEnd  L
3 chr12 2224708 2225580 5.416306 59.579368       3          1       18 18
6 chr12 2279596 2279661 5.432257 10.864514       6          5       10  6
  clusterL
3       11
6        2

Those numbers don't make sense though, L should not be longer than clusterL, and if you look at what's actually contained in the first range, it's only 9 sites, not 18. I have noticed that the issue goes away if I manually order my data, but because the default for assumeSorted = FALSE it seems that I shouldn't have to. My data comes from multiple locations so it is a pain to have to pull it all together to order, rather than just being able to pull the relevant vectors. My version in use is bumphunter_1.28.0.