rkillick / changepoint

A place for the development version of the changepoint package on CRAN.
128 stars 33 forks source link

dependence on scaling of data #26

Open stephens999 opened 6 years ago

stephens999 commented 6 years ago

I noticed that the scaling of the data matters, which seems undesirable (and unnecessary).

For example:

set.seed(51)
true_mean = rep(c(-0.2,0.1,1,-0.5,0.2,-0.5,0.1,-0.2),c(137,87,17,49,29,52,87,42))
genomdat = list(x = rnorm(500, sd=0.2) + true_mean, true_mean=true_mean)

The cpt.mean default does not find any changepoints:

genomdat.cp = cpt.mean(genomdat$x,method="PELT")
plot(genomdat.cp)

But if we multiply the data by 10 we find many changepoints.

genomdat.cp = cpt.mean(10*genomdat$x,method="PELT")
plot(genomdat.cp)

I speculate that perhaps the cost function (log-likelihood) implicitly assumes the variance is 1?

Incidentally to this, while digging around the code to see if I could understand the issue, I noticed that some places in the code use "norm.mean" whereas others use "mean.norm". I'm not sure that was intended?

Matthews-MacBook-Air-2:changepoint stephens$ grep norm.mean src/*
src/BinSeg_one_func_minseglen.c:     char **cost_func; //Descibe the cost function used i.e. norm.mean.cost (change in mean in normal distributed data)  
src/BinSeg_one_func_minseglen.c:  {"norm.mean", mll_mean},
src/BinSeg_one_func_minseglen.c:  {"norm.meanvar", mll_meanvar},
Matthews-MacBook-Air-2:changepoint stephens$ grep mean.norm src/*
src/BinSeg_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm")==0){
src/BinSeg_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm.mbic")==0){
src/PELT_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm")==0){
src/PELT_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm.mbic")==0){
rkillick commented 5 years ago

I agree that this is an undesirable "feature" but I haven't decided on a suitable "default" behaviour for estimating the variance in the presence of a mean change. If you have any suggestions, please let me know.

stephens999 commented 5 years ago

i'd suggest 0.5*median(abs(diff(y))/0.6745)^2 which is based on the fact that yt-y{t+1} \sim N(0,2sigma^2) if there is no difference in mean between t and t+1 and E(median(abs(N(0,sigma^2)))) = 0.6745 sigma

rkillick commented 5 years ago

Thanks for the suggestion, i'll see how it behaves on different simulated data. The best one i've found at the moment is a sample variance on a rolling window of length 30 and then taking the median of those. But this is undesirable when you have long time series.