I'm trying to understand locfitByCluster a bit more. Currently, I'm using it for some clusters that can have large gaps and I'm seeing the following warning several times:
Warning in lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
I made the following small reproducible example. It generates the above warning 3 times. I looked at the first cluster where this happened and ran it manually.
library('bumphunter')
set.seed(20160428)
l <- 100
y <- sample(0:4, size = l, replace = TRUE)
x <- sort(sample(seq_len(3000 * l), size = l))
cluster <- clusterMaker(rep('chr1', l), x, maxGap = 3000)
length(unique(cluster))
test <- locfitByCluster(y = y, x = x, cluster = cluster, weights = NULL, minNum = 7, bpSpan = 1000, minInSpan = 0)
table(test$smoothed)
## Runs fine with a higher minInSpan value (also has warnings with minInSpan = 1)
test2 <- locfitByCluster(y = y, x = x, cluster = cluster, weights = NULL, minNum = 7, bpSpan = 1000, minInSpan = 2)
table(test2$smoothed)
## Identical in this case, but not in my larger use case
identical(test$fitted, test2$fitted)
## Run pieces of
## https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L53-L93
y <- matrix(y, ncol = 1)
bpSpan = 1000
minNum = 7
minInSpan = 0
weights = NULL
verbose = TRUE
weights <- matrix(1, nrow = nrow(y), ncol = ncol(y))
Indexes <- split(seq(along = cluster), cluster)
clusterL <- sapply(Indexes, length)
smoothed <- rep(TRUE, nrow(y))
Index <- Indexes[[2]]
nn <- minInSpan / length(Index)
j <- 1
sdata <- data.frame(pos = x[Index],
y = y[Index, j],
weights = weights[Index,j])
fit <- locfit(y ~ lp(pos, nn = nn, h = bpSpan), data = sdata,
weights = weights, family = "gaussian", maxk = 10000)
## Explore positions
x[cluster == 2]
## Reproducibility info
proc.time()
message(Sys.time())
options(width = 120)
devtools::session_info()
You'll note that if you change how y is created, to lets say y <- rnorm(l, mean = 100) the above example doesn't produce any more warnings.
The actual output of the example is below:
> library('bumphunter')
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:base’:
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: foreach
foreach: simple, scalable parallel programming from Revolution Analytics
Use Revolution R for scalability, fault tolerance and more.
http://www.revolutionanalytics.com
Loading required package: iterators
Loading required package: locfit
locfit 1.5-9.1 2013-03-22
> set.seed(20160428)
> l <- 100
> y <- sample(0:4, size = l, replace = TRUE)
> x <- sort(sample(seq_len(3000 * l), size = l))
> cluster <- clusterMaker(rep('chr1', l), x, maxGap = 3000)
> length(unique(cluster))
[1] 35
> test <- locfitByCluster(y = y, x = x, cluster = cluster, weights = NULL, minNum = 7, bpSpan = 1000, minInSpan = 0)
Warning messages:
1: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
2: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
3: In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
> table(test$smoothed)
FALSE TRUE
78 22
>
> ## Runs fine with a higher minInSpan value (also has warnings with minInSpan = 1)
> test2 <- locfitByCluster(y = y, x = x, cluster = cluster, weights = NULL, minNum = 7, bpSpan = 1000, minInSpan = 2)
> table(test2$smoothed)
FALSE TRUE
78 22
>
> ## Identical in this case, but not in my larger use case
> identical(test$fitted, test2$fitted)
[1] TRUE
>
>
> ## Run pieces of
> ## https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L53-L93
>
> y <- matrix(y, ncol = 1)
> bpSpan = 1000
> minNum = 7
> minInSpan = 0
> weights = NULL
> verbose = TRUE
> weights <- matrix(1, nrow = nrow(y), ncol = ncol(y))
> Indexes <- split(seq(along = cluster), cluster)
> clusterL <- sapply(Indexes, length)
> smoothed <- rep(TRUE, nrow(y))
>
> Index <- Indexes[[2]]
> nn <- minInSpan / length(Index)
>
>
> j <- 1
> sdata <- data.frame(pos = x[Index],
+ y = y[Index, j],
+ weights = weights[Index,j])
> fit <- locfit(y ~ lp(pos, nn = nn, h = bpSpan), data = sdata,
+ weights = weights, family = "gaussian", maxk = 10000)
Warning message:
In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
>
> ## Explore positions
> x[cluster == 2]
[1] 12132 13144 15467 17207 17676 19193 20194
>
> ## Reproducibility info
> proc.time()
user system elapsed
7.615 0.347 7.785
> message(Sys.time())
2016-04-28 14:34:52
> options(width = 120)
> devtools::session_info()
Session info -----------------------------------------------------------------------------------------------------------
setting value
version R version 3.3.0 alpha (2016-03-23 r70368)
system x86_64, darwin13.4.0
ui AQUA
language (EN)
collate en_US.UTF-8
tz America/New_York
date 2016-04-28
Packages ---------------------------------------------------------------------------------------------------------------
package * version date source
AnnotationDbi 1.33.14 2016-04-27 Bioconductor
Biobase 2.31.3 2016-01-14 Bioconductor
BiocGenerics * 0.17.5 2016-04-11 Bioconductor
BiocParallel 1.5.22 2016-04-27 Bioconductor
biomaRt 2.27.2 2016-01-14 Bioconductor
Biostrings 2.39.14 2016-04-14 Bioconductor
bitops 1.0-6 2013-08-17 CRAN (R 3.3.0)
bumphunter * 1.11.5 2016-03-29 Bioconductor
codetools 0.2-14 2015-07-15 CRAN (R 3.3.0)
DBI 0.3.1 2014-09-24 CRAN (R 3.3.0)
devtools 1.11.1 2016-04-21 CRAN (R 3.3.0)
digest 0.6.9 2016-01-08 CRAN (R 3.3.0)
doRNG 1.6 2014-03-07 CRAN (R 3.3.0)
foreach * 1.4.3 2015-10-13 CRAN (R 3.3.0)
GenomeInfoDb * 1.7.6 2016-01-29 Bioconductor
GenomicAlignments 1.7.21 2016-04-12 Bioconductor
GenomicFeatures 1.23.30 2016-04-12 Bioconductor
GenomicRanges * 1.23.27 2016-04-12 Bioconductor
IRanges * 2.5.46 2016-04-17 Bioconductor
iterators * 1.0.8 2015-10-13 CRAN (R 3.3.0)
lattice 0.20-33 2015-07-14 CRAN (R 3.3.0)
locfit * 1.5-9.1 2013-04-20 CRAN (R 3.3.0)
magrittr 1.5 2014-11-22 CRAN (R 3.3.0)
matrixStats 0.50.2 2016-04-24 CRAN (R 3.3.0)
memoise 1.0.0 2016-01-29 CRAN (R 3.3.0)
pkgmaker 0.22 2014-05-14 CRAN (R 3.3.0)
RCurl 1.95-4.8 2016-03-01 CRAN (R 3.3.0)
registry 0.3 2015-07-08 CRAN (R 3.3.0)
rngtools 1.2.4 2014-03-06 CRAN (R 3.3.0)
Rsamtools 1.23.9 2016-04-27 Bioconductor
RSQLite 1.0.0 2014-10-25 CRAN (R 3.3.0)
rtracklayer 1.31.10 2016-04-07 Bioconductor
S4Vectors * 0.9.52 2016-04-27 Bioconductor
stringi 1.0-1 2015-10-22 CRAN (R 3.3.0)
stringr 1.0.0 2015-04-30 CRAN (R 3.3.0)
SummarizedExperiment 1.1.27 2016-04-27 Bioconductor
withr 1.0.1 2016-02-04 CRAN (R 3.3.0)
XML 3.98-1.4 2016-03-01 CRAN (R 3.3.0)
xtable 1.8-2 2016-02-05 CRAN (R 3.3.0)
XVector 0.11.8 2016-04-06 Bioconductor
zlibbioc 1.17.1 2016-03-19 Bioconductor
>
Looking at the help of locfitByCluster I see the following:
minNum
Clusters with fewer than minNum locations will not be smoothed
minInSpan
Only smooth the region if there are at least this many locations in the span.
However, I only see minInSpan play a role at https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L76 which doesn't seem to match the description to me. Specially since length(Index) is the length of the cluster, not of the sub-regions in the cluster. Or maybe you meant minInSpan: Only smooth the cluster if there at least this many locations in the span.
case LF_NOPT:
WARN(("procv: no points with non-zero weight"));
set_default_like(&lf->fp,v);
return(k);
In this particular small example, the returned values are the same. But in my larger use case they are not. I'm looking at base-resolution F-statistics (that's my y) while using minNum = 100 since the read length is 100 for this data set.
Parts of my code is:
> cluster <- derfinder:::.clusterMakerRle(prep$position, 3000)
> test <- locfitByCluster(y = fstats, x = which(prep$position), cluster = cluster, weights = NULL, minNum = 100, bpSpan = 300, minInSpan = 0) ## leads to many warnings
> test2 <- locfitByCluster(y = fstats, x = which(prep$position), cluster = cluster, weights = NULL, minNum = 100, bpSpan = 300, minInSpan = 5) ## no warnings, takes longer to run
> identical(test$smoothed, test2$smoothed)
[1] TRUE
> identical(test$fitted, test2$fitted)
[1] FALSE
> summary(test$fitted - test2$fitted)
V1
Min. :-50.595
1st Qu.: 0.000
Median : 0.000
Mean : 0.000
3rd Qu.: 0.000
Max. : 7.985
NA's :12132
> test3 <- locfitByCluster(y = fstats, x = which(prep$position), cluster = cluster, weights = NULL, minNum = 100, bpSpan = 300, minInSpan = 100) ## Faster than with minInSpan = 5
> identical(test3$smoothed, test2$smoothed)
[1] TRUE
> identical(test3$fitted, test2$fitted)
[1] FALSE
> summary(test3$fitted - test2$fitted)
V1
Min. :-41.88
1st Qu.: 0.00
Median : 0.00
Mean : 0.00
3rd Qu.: 0.00
Max. : 22.14
NA's :12132
> summary(test$fitted - test3$fitted)
V1
Min. :-50.41
1st Qu.: 0.00
Median : 0.00
Mean : 0.00
3rd Qu.: 0.00
Max. : 41.88
NA's :12132
The full code is here although the data is not there.
I think that I should use minInSpan equal to minNum (100 in my use case). However, the description of minInSpan doesn't match what it does. I know minInSpan / length(Index) is passed to locfit::lp(nn) whose description is:
nn
Nearest neighbor component of the smoothing parameter. Default value is 0.7, unless either h or adpen are provided, in which case the default is 0.
This leaves me confused. Maybe there is something I'm missing on how minInSpan actually enforces the minimum number of locations needed before smoothing. I'm guessing nn should be between 0 and 1, but looking at my data, in some cases length(Index) is smaller than 100, which would lead to nn values greater than 1. Edit: actually https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L75 would pick these up since clusterL[i] is less than minNum = 100.
I'll appreciate it if you can help me figure out what is the proper minInSpan value to use. I know that I can run the code right now without changing minInSpan, but it doesn't feel right to have hundreds of warnings.
Hi,
I'm trying to understand
locfitByCluster
a bit more. Currently, I'm using it for some clusters that can have large gaps and I'm seeing the following warning several times:I made the following small reproducible example. It generates the above warning 3 times. I looked at the first cluster where this happened and ran it manually.
You'll note that if you change how
y
is created, to lets sayy <- rnorm(l, mean = 100)
the above example doesn't produce any more warnings.The actual output of the example is below:
Looking at the help of
locfitByCluster
I see the following:However, I only see
minInSpan
play a role at https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L76 which doesn't seem to match the description to me. Specially sincelength(Index)
is the length of the cluster, not of the sub-regions in the cluster. Or maybe you meantminInSpan: Only smooth the cluster if there at least this many locations in the span
.The warning is eventually produced by
locfit::locfit
and in particular by lines 88-91 ofprocv.c
(from the package source at https://cran.rstudio.com/web/packages/locfit/index.html) which reads:In this particular small example, the returned values are the same. But in my larger use case they are not. I'm looking at base-resolution F-statistics (that's my
y
) while usingminNum = 100
since the read length is 100 for this data set.Parts of my code is:
The full code is here although the data is not there.
I think that I should use
minInSpan
equal tominNum
(100 in my use case). However, the description ofminInSpan
doesn't match what it does. I knowminInSpan / length(Index)
is passed tolocfit::lp(nn)
whose description is:This leaves me confused. Maybe there is something I'm missing on how
minInSpan
actually enforces the minimum number of locations needed before smoothing. I'm guessingnn
should be between 0 and 1, but looking at my data, in some caseslength(Index)
is smaller than 100, which would lead tonn
values greater than 1. Edit: actually https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L75 would pick these up sinceclusterL[i]
is less thanminNum = 100
.I'll appreciate it if you can help me figure out what is the proper
minInSpan
value to use. I know that I can run the code right now without changingminInSpan
, but it doesn't feel right to have hundreds of warnings.Thanks! Leo