mpierrejean / jointseg

6 stars 1 forks source link

True breakpoints not always recovered in noiseless profiles #6

Open pneuvial opened 7 years ago

pneuvial commented 7 years ago

When using the default parameters, 'jointSeg' fails to recover the true breakpoints exactly in noiseless Gaussian profiles, e.g.:

p <- 2
trueK <- 10
len <- 1e4

eq <- eqB <- eqL <- list()
for (ii in 1:100) {
    sim <- randomProfile(length=len, nBkp=trueK, noiseLevel=0, dim=p)
    Y <- sim$profile
    K <- 2*trueK

    res <- jointSeg(Y, method="RBS", K=K, modelSelectionMethod="Birge")
    bkp <- res$bestBkp
    eqB[[ii]] <- all.equal(bkp, sim$bkp)

    res <- jointSeg(Y, method="RBS", K=K, modelSelectionMethod="Lebarbier")
    bkp <- res$bestBkp
    eqL[[ii]] <- all.equal(bkp, sim$bkp)

    bkp <- res$dpBkp[[length(sim$bkp)]]
    eq[[ii]] <- all.equal(bkp, sim$bkp)
}

gives the following results:

  1. If the true number of breakpoints is known, then RBS+DP gets it correct:
> table(unlist(eq))
TRUE 
 100 
  1. If the true number of breakpoints is unknown, then (a) RBS+DP+model selection à la Birgé gets it correct as well:
> table(unlist(eqB))
TRUE 
 100 

(b) RBS+DP+model selection à la Lebarbier underestimates the true number of breakpoints in more than 1/3 of cases:

> table(unlist(eqL))
Numeric: lengths (7, 10) differ Numeric: lengths (8, 10) differ 
                              1                               4 
Numeric: lengths (9, 10) differ                            TRUE 
                             36                              59 

@mpierrejean , what is the reference for the formula we are using in 'chooseK' ? In the original paper https://hal.archives-ouvertes.fr/inria-00071847 and in the Picard et al 2005 paper, the formula is

pen(K) = 2K(c1 + c2 log(n/K))

(where K is the number of segments), while in 'chooseK' it seems that we are using

pen(K) = K*(c+log(n/K))

Is there a reason for this difference? And why are we using c=2.5 ?

pneuvial commented 7 years ago

Some more code to perform model selection directly after the above code:

Y <- sim$profile
K <- length(2*sim$bkp)
res <- jointSeg(Y, method="RBS", K=K)
bkp <- res$bestBkp

resDP <- pruneByDP(sim$profile, candCP=res$initBkp)
modelSelection(resDP$rse, n=nrow(sim$profile), method="Lebarbier")