rkillick / changepoint

A place for the development version of the changepoint package on CRAN.
127 stars 33 forks source link

cpt.meanvar return cost value? #49

Open tdhock opened 3 years ago

tdhock commented 3 years ago

hey again @rkillick I'm using cpt.meanvar in class next week and I noticed that it can sometimes return segment variance of zero,

> changepoint::cpt.meanvar(c(0,0,4,5), penalty="Manual", method="PELT", pen.value=0)@param.est
$mean
[1] 0.0 4.5

$variance
[1] 0.00 0.25

I assume you are minimizing the negative log likelihood is that correct? In that case the cost of this model should be -Inf, right? Would it be possible to return the cost value, please? (it would be helpful)

In this case the variance is estimated as zero because there are two consecutive data points which have the same value. I notice that you enforce minseglen=2 -- is this an effort to avoid segments of zero variance? i.e. only allow models which are "well-defined" in the sense that they have a finite log likelihood value? If so you may consider an adaptive approach, by either using a run-length encoding/weights prior to running the algo OR by not allowing segments of zero variance during the algo.

FYI I used PELT above but the problem seems to affect SegNeigh as well.

rkillick commented 3 years ago

To avoid calculating variances on segments which do not vary there is a catch in the code whereby if a negative variance is estimated then the returned variance is replaced with a value very close to 0 (just within machine precision). This is to avoid a division by 0 in the likelihood - as you point out.

The minseglen=2 minimum is because we are estimating two parameters per segment, the mean and the variance. Thus we need at least two data points to do this.

The cost function is the same for all search methods so this would affect them all.

rkillick commented 2 years ago

Thinking about this more there are legitimate reasons to have a segment with a variance of 0, where all values are equal. This could be for a small length, in which case you likely don't want it identified as a changepoint but could also be longer lengths (I'm looking at some data which is rounded and so we do get runs of the same value). I personally think that this should be handled by the user and not the algorithm. Therefore I think a warning by the changepoint methods when we have sequential observations with the same value should suffice.

tdhock commented 2 years ago

I agree that a warning would be a step in the right direction.