rafalab / bumphunter

bumphunter
15 stars 14 forks source link

How to choose the `minInSpan` argument of `locfitByCluster`? #10

Open lcolladotor opened 8 years ago

lcolladotor commented 8 years ago

Hi,

I'm trying to understand locfitByCluster a bit more. Currently, I'm using it for some clusters that can have large gaps and I'm seeing the following warning several times:

Warning in lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth,  :
  procv: no points with non-zero weight

I made the following small reproducible example. It generates the above warning 3 times. I looked at the first cluster where this happened and ran it manually.

library('bumphunter')
set.seed(20160428)
l <- 100
y <- sample(0:4, size = l, replace = TRUE)
x <- sort(sample(seq_len(3000 * l), size = l))
cluster <- clusterMaker(rep('chr1', l), x, maxGap = 3000)
length(unique(cluster))
test <- locfitByCluster(y = y, x = x, cluster = cluster, weights = NULL, minNum = 7, bpSpan = 1000, minInSpan = 0)
table(test$smoothed)

## Runs fine with a higher minInSpan value (also has warnings with minInSpan = 1)
test2 <- locfitByCluster(y = y, x = x, cluster = cluster, weights = NULL, minNum = 7, bpSpan = 1000, minInSpan = 2)
table(test2$smoothed)

## Identical in this case, but not in my larger use case
identical(test$fitted, test2$fitted)

## Run pieces of
## https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L53-L93

y <- matrix(y, ncol = 1) 
bpSpan = 1000
minNum = 7
minInSpan = 0
weights = NULL
verbose = TRUE
weights <- matrix(1, nrow = nrow(y), ncol = ncol(y))
Indexes <- split(seq(along = cluster), cluster)
clusterL <- sapply(Indexes, length)
smoothed <- rep(TRUE, nrow(y))

Index <- Indexes[[2]]
nn <- minInSpan / length(Index)    

j <- 1
sdata <- data.frame(pos = x[Index],
                    y = y[Index, j],
                    weights = weights[Index,j])
fit <- locfit(y ~ lp(pos, nn = nn, h = bpSpan), data = sdata,
              weights = weights, family = "gaussian", maxk = 10000)       

## Explore positions
x[cluster == 2]

## Reproducibility info
proc.time()
message(Sys.time())
options(width = 120)
devtools::session_info()

You'll note that if you change how y is created, to lets say y <- rnorm(l, mean = 100) the above example doesn't produce any more warnings.

The actual output of the example is below:

> library('bumphunter')
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: foreach
foreach: simple, scalable parallel programming from Revolution Analytics
Use Revolution R for scalability, fault tolerance and more.
http://www.revolutionanalytics.com
Loading required package: iterators
Loading required package: locfit
locfit 1.5-9.1   2013-03-22
> set.seed(20160428)
> l <- 100
> y <- sample(0:4, size = l, replace = TRUE)
> x <- sort(sample(seq_len(3000 * l), size = l))
> cluster <- clusterMaker(rep('chr1', l), x, maxGap = 3000)
> length(unique(cluster))
[1] 35
> test <- locfitByCluster(y = y, x = x, cluster = cluster, weights = NULL, minNum = 7, bpSpan = 1000, minInSpan = 0)
Warning messages:
1: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth,  :
  procv: no points with non-zero weight
2: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth,  :
  procv: no points with non-zero weight
3: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth,  :
  procv: no points with non-zero weight
> table(test$smoothed)

FALSE  TRUE 
   78    22 
> 
> ## Runs fine with a higher minInSpan value (also has warnings with minInSpan = 1)
> test2 <- locfitByCluster(y = y, x = x, cluster = cluster, weights = NULL, minNum = 7, bpSpan = 1000, minInSpan = 2)
> table(test2$smoothed)

FALSE  TRUE 
   78    22 
> 
> ## Identical in this case, but not in my larger use case
> identical(test$fitted, test2$fitted)
[1] TRUE
> 
> 
> ## Run pieces of
> ## https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L53-L93
> 
> y <- matrix(y, ncol = 1) 
> bpSpan = 1000
> minNum = 7
> minInSpan = 0
> weights = NULL
> verbose = TRUE
> weights <- matrix(1, nrow = nrow(y), ncol = ncol(y))
> Indexes <- split(seq(along = cluster), cluster)
> clusterL <- sapply(Indexes, length)
> smoothed <- rep(TRUE, nrow(y))
> 
> Index <- Indexes[[2]]
> nn <- minInSpan / length(Index)    
> 
> 
> j <- 1
> sdata <- data.frame(pos = x[Index],
+                     y = y[Index, j],
+                     weights = weights[Index,j])
> fit <- locfit(y ~ lp(pos, nn = nn, h = bpSpan), data = sdata,
+               weights = weights, family = "gaussian", maxk = 10000)       
Warning message:
In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth,  :
  procv: no points with non-zero weight
> 
> ## Explore positions
> x[cluster == 2]
[1] 12132 13144 15467 17207 17676 19193 20194
> 
> ## Reproducibility info
> proc.time()
   user  system elapsed 
  7.615   0.347   7.785 
> message(Sys.time())
2016-04-28 14:34:52
> options(width = 120)
> devtools::session_info()
Session info -----------------------------------------------------------------------------------------------------------
 setting  value                                    
 version  R version 3.3.0 alpha (2016-03-23 r70368)
 system   x86_64, darwin13.4.0                     
 ui       AQUA                                     
 language (EN)                                     
 collate  en_US.UTF-8                              
 tz       America/New_York                         
 date     2016-04-28                               

Packages ---------------------------------------------------------------------------------------------------------------
 package              * version  date       source        
 AnnotationDbi          1.33.14  2016-04-27 Bioconductor  
 Biobase                2.31.3   2016-01-14 Bioconductor  
 BiocGenerics         * 0.17.5   2016-04-11 Bioconductor  
 BiocParallel           1.5.22   2016-04-27 Bioconductor  
 biomaRt                2.27.2   2016-01-14 Bioconductor  
 Biostrings             2.39.14  2016-04-14 Bioconductor  
 bitops                 1.0-6    2013-08-17 CRAN (R 3.3.0)
 bumphunter           * 1.11.5   2016-03-29 Bioconductor  
 codetools              0.2-14   2015-07-15 CRAN (R 3.3.0)
 DBI                    0.3.1    2014-09-24 CRAN (R 3.3.0)
 devtools               1.11.1   2016-04-21 CRAN (R 3.3.0)
 digest                 0.6.9    2016-01-08 CRAN (R 3.3.0)
 doRNG                  1.6      2014-03-07 CRAN (R 3.3.0)
 foreach              * 1.4.3    2015-10-13 CRAN (R 3.3.0)
 GenomeInfoDb         * 1.7.6    2016-01-29 Bioconductor  
 GenomicAlignments      1.7.21   2016-04-12 Bioconductor  
 GenomicFeatures        1.23.30  2016-04-12 Bioconductor  
 GenomicRanges        * 1.23.27  2016-04-12 Bioconductor  
 IRanges              * 2.5.46   2016-04-17 Bioconductor  
 iterators            * 1.0.8    2015-10-13 CRAN (R 3.3.0)
 lattice                0.20-33  2015-07-14 CRAN (R 3.3.0)
 locfit               * 1.5-9.1  2013-04-20 CRAN (R 3.3.0)
 magrittr               1.5      2014-11-22 CRAN (R 3.3.0)
 matrixStats            0.50.2   2016-04-24 CRAN (R 3.3.0)
 memoise                1.0.0    2016-01-29 CRAN (R 3.3.0)
 pkgmaker               0.22     2014-05-14 CRAN (R 3.3.0)
 RCurl                  1.95-4.8 2016-03-01 CRAN (R 3.3.0)
 registry               0.3      2015-07-08 CRAN (R 3.3.0)
 rngtools               1.2.4    2014-03-06 CRAN (R 3.3.0)
 Rsamtools              1.23.9   2016-04-27 Bioconductor  
 RSQLite                1.0.0    2014-10-25 CRAN (R 3.3.0)
 rtracklayer            1.31.10  2016-04-07 Bioconductor  
 S4Vectors            * 0.9.52   2016-04-27 Bioconductor  
 stringi                1.0-1    2015-10-22 CRAN (R 3.3.0)
 stringr                1.0.0    2015-04-30 CRAN (R 3.3.0)
 SummarizedExperiment   1.1.27   2016-04-27 Bioconductor  
 withr                  1.0.1    2016-02-04 CRAN (R 3.3.0)
 XML                    3.98-1.4 2016-03-01 CRAN (R 3.3.0)
 xtable                 1.8-2    2016-02-05 CRAN (R 3.3.0)
 XVector                0.11.8   2016-04-06 Bioconductor  
 zlibbioc               1.17.1   2016-03-19 Bioconductor  
> 

Looking at the help of locfitByCluster I see the following:

minNum    
Clusters with fewer than minNum locations will not be smoothed

minInSpan    
Only smooth the region if there are at least this many locations in the span.

However, I only see minInSpan play a role at https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L76 which doesn't seem to match the description to me. Specially since length(Index) is the length of the cluster, not of the sub-regions in the cluster. Or maybe you meant minInSpan: Only smooth the cluster if there at least this many locations in the span.

The warning is eventually produced by locfit::locfit and in particular by lines 88-91 of procv.c (from the package source at https://cran.rstudio.com/web/packages/locfit/index.html) which reads:

    case LF_NOPT:
      WARN(("procv: no points with non-zero weight"));
      set_default_like(&lf->fp,v);
      return(k);

In this particular small example, the returned values are the same. But in my larger use case they are not. I'm looking at base-resolution F-statistics (that's my y) while using minNum = 100 since the read length is 100 for this data set.

Parts of my code is:

> cluster <- derfinder:::.clusterMakerRle(prep$position, 3000)
> test <- locfitByCluster(y = fstats, x = which(prep$position), cluster = cluster, weights = NULL, minNum = 100, bpSpan = 300, minInSpan = 0) ## leads to many warnings
> test2 <- locfitByCluster(y = fstats, x = which(prep$position), cluster = cluster, weights = NULL, minNum = 100, bpSpan = 300, minInSpan = 5) ## no warnings, takes longer to run
> identical(test$smoothed, test2$smoothed)
[1] TRUE
> identical(test$fitted, test2$fitted)
[1] FALSE
> summary(test$fitted - test2$fitted)
       V1
 Min.   :-50.595
 1st Qu.:  0.000
 Median :  0.000
 Mean   :  0.000
 3rd Qu.:  0.000
 Max.   :  7.985
 NA's   :12132
> test3 <- locfitByCluster(y = fstats, x = which(prep$position), cluster = cluster, weights = NULL, minNum = 100, bpSpan = 300, minInSpan = 100) ## Faster than with minInSpan = 5
> identical(test3$smoothed, test2$smoothed)
[1] TRUE
> identical(test3$fitted, test2$fitted)
[1] FALSE
> summary(test3$fitted - test2$fitted)
       V1
 Min.   :-41.88
 1st Qu.:  0.00
 Median :  0.00
 Mean   :  0.00
 3rd Qu.:  0.00
 Max.   : 22.14
 NA's   :12132
> summary(test$fitted - test3$fitted)
       V1
 Min.   :-50.41
 1st Qu.:  0.00
 Median :  0.00
 Mean   :  0.00
 3rd Qu.:  0.00
 Max.   : 41.88
 NA's   :12132

The full code is here although the data is not there.

I think that I should use minInSpan equal to minNum (100 in my use case). However, the description of minInSpan doesn't match what it does. I know minInSpan / length(Index) is passed to locfit::lp(nn) whose description is:

nn  
Nearest neighbor component of the smoothing parameter. Default value is 0.7, unless either h or adpen are provided, in which case the default is 0.

This leaves me confused. Maybe there is something I'm missing on how minInSpan actually enforces the minimum number of locations needed before smoothing. I'm guessing nn should be between 0 and 1, but looking at my data, in some cases length(Index) is smaller than 100, which would lead to nn values greater than 1. Edit: actually https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L75 would pick these up since clusterL[i] is less than minNum = 100.

I'll appreciate it if you can help me figure out what is the proper minInSpan value to use. I know that I can run the code right now without changing minInSpan, but it doesn't feel right to have hundreds of warnings.

Thanks! Leo