robjhyndman / tsfeatures

Time series features
http://pkg.robjhyndman.com/tsfeatures
253 stars 42 forks source link

`sampenc` implementation #27

Open kohleth opened 5 years ago

kohleth commented 5 years ago

Hi,

I am trying to understand the logic behind sampenc. I figured it is a modification of douglas lake's code? But there seems to be 2 issues:

  1. the line p <- A[2]/B[1] I think should be p<-A[M]/B[M-1]

issue 2 is, I don't quite get lake's code (is it correct?). I tried to follow his argument on https://www.physionet.org/content/sampen/1.0.0/ especially

If a particular run ends up being of length 4, for example, then that means that 1 is added to the count for template matches of length 4. In addition, there are 2 template matches of length 3, 3 of length 2, and 4 of length 1 that need to be added to the corresponding counts.

But I couldn't verify this behaviour using some simple simulated series.

So I

  1. re-implement Richman and Moorman original algorithm literally
  2. use pracma::sample_entropy
  3. re-implement Lake's matlab literally
  4. modify 'tsfeatures::sampenc` according to point 1 above

and did a comparison

my_sam_en=function(ts,M,r){  ## richmand and moormand literally
  N=length(ts)
  subts=function(.ts,l,exclude_last=T){
    N=length(.ts)
    if(exclude_last)return(lapply(1:(N-l),function(i).ts[i:(i+l-1)]))
    lapply(1:(N-l+1),function(i).ts[i:(i+l-1)])
  }
  ## all length M sequences, up to begining from N-M
  x_M=subts(ts,M,exclude_last = T)
  x_Mp1=subts(ts,M+1,exclude_last=F)

  count_r=function(xx){
    # browser()
    d=Vectorize(function(a,b)max(abs(xx[[a]]-xx[[b]])),c('a','b'))
    N=length(xx)
    O=outer(1:N,1:N,d)
    sum(O<(r-1e-10))-nrow(O)
  }
  C_M_r=count_r(x_M)
  C_Mp1_r=count_r(x_Mp1)

  -log(C_Mp1_r/C_M_r)  
}

modified_sampenc=function (y, M = 6, r = 0.3)   ## tsfeatures::sampenc
{
  N <- length(y)
  lastrun <- numeric(N)
  run <- numeric(N)
  A <- numeric(M)
  B <- numeric(M)
  for (i in 1:(N - 1)) {
    y1 <- y[i]
    for (jj in 1:(N - i)) {
      j <- i + jj
      if (isTRUE(abs(y[j] - y1) < r)) {
        run[jj] <- lastrun[jj] + 1
        M1 <- min(M, run[jj])
        A[1:M1] <- A[1:M1] + 1
        if (j < N) 
          B[1:M1] <- B[1:M1] + 1
      }
      else {
        run[jj] <- 0
      }
    }
    for (j in 1:(N - i)) {
      lastrun[j] <- run[j]
    }
  }
  p <- A[M]/B[M-1]
  e <- -log(p)
  return(e)
}
lake=function(y,M,r){
  n=length(y);
  lastrun=rep(0,n);
  run=rep(0,n);
  A=rep(0,M);
  B=rep(0,M);
  p=rep(0,M);
  e=rep(0,M);
  for(i in 1:(n-1)){
    nj=n-i;
    y1=y[i];
    for(jj in 1:nj){
      j=jj+i;      
      if(abs(y[j]-y1)<r){
        run[jj]=lastrun[jj]+1;
        M1=min(M,run[jj]);
        for(m in 1:M1){
          A[m]=A[m]+1;
          if(j<n)B[m]=B[m]+1;
        }           
      }else{
        run[jj]=0;
      }
    }
    for(j in 1:nj){
      lastrun[j]=run[j];    
    }
  }
  N=n*(n-1)/2;
  B=c(N,B[1:(M-1)]);
  p=A/B;
  e=-log(p);
  e[length(e)] ## only return the last value
}
y=sin(1:100)
M=2;r=0.3
c(pracma::sample_entropy(sin(1:100),edim=M,r=r),
  my_sam_en(y,M,r),
  modified_sampenc(y,M,r),
  lake(y,M,r),
  tsfeatures::sampenc(y,M,r))
# [1] 0.1006652 0.1006652 0.8065081 0.8065081 0.8065081

M=3;r=0.2
c(pracma::sample_entropy(sin(1:100),edim=M,r=r),
  my_sam_en(y,M,r),
  modified_sampenc(y,M,r),
  lake(y,M,r),
  tsfeatures::sampenc(y,M,r))
# [1] 0.0000000 0.0000000 0.1459539 0.1459539 0.9808293

So pracma:;sample_entropy (and my translation of the original paper) more or less agree while strongly disagreeing with lake's code (and thus tsfeatures::sampenc).

But again, I couldn't follow lake's algorithm at all.

wgova commented 2 years ago

@robjhyndman - is this repo/package still being maintained?

Are the features computed here equivalent to those in FEASTS - or are the two slowly diverging to make way for multi-series computations?

robjhyndman commented 2 years ago

tsfeatures is still being maintained, at least fixing bugs, etc., and we will continue to ensure it stays on CRAN. The features in feasts may diverge as that is under more active development.