twitter / BreakoutDetection

Breakout Detection via Robust E-Statistics
GNU General Public License v2.0
755 stars 181 forks source link

'Interesting' behaviour in EDM-X when data has a small SD #22

Open richdrich opened 8 years ago

richdrich commented 8 years ago

I found some unusual behavior as the standard deviation of some test data (on either side of a step change) drops.

When the sd is less than 1, the detection of the change becomes inaccurate - in a very defined manner. [EDIT: I'd note that the '1' is a big coincidence - the knee changes as the data range changes, as you might expect]

See the below. My data actually changes at point 500, EDM-X finds this to within two intervals above that and is out by 50 intervals below.

I'd be interested in any comments on this...

library(BreakoutDetection)

# Try EDM-X on SDs over a (log) range
logSds <- seq(from=-0.2, to=0.2, by=.05)
sds <- 10 ^ logSds
errs <- vector(,length(sds))

erri <- 1
for(i in logSds) {
  sd <- 10 ^ i

  set.seed(123)
  # construct datasets
  s1 <- zoo(rnorm(500,mean=100,sd=sd), seq.POSIXt(as.POSIXlt("2016-01-01"), by=3600,length.out=500))
  s2 <- zoo(rnorm(400,mean=110,sd=sd), seq.POSIXt(as.POSIXlt("2016-01-21 20:00:00"), by=3600,length.out=400))

  st <- rbind(s1, s2)

  zdata <- data.frame(timestamp=time(st), count=as.vector(st))

  br <- breakout(zdata,min.size=100, method='amoc', plot=T)

  errs[erri] <- abs(br$loc - 500)
  erri <- erri + 1
}

plot(sds, errs)
richdrich commented 8 years ago

Further to this, I think the issue is when the SD and the number of observations are such that there the two medians tend to 1 and 0 for all data ranges (values of tau2 and tau1) => that leads to the medians being ignored and the algorithm converging at a point determined by tau2 and tau1 (and hence the sizes of the two datasets), which isn't the actual breakout. Or something like that.

I'm thinking this won't be too much of a problem with real data (I found it with a naive test case) but would be interested in any comments?