twitter / BreakoutDetection

Breakout Detection via Robust E-Statistics
GNU General Public License v2.0
753 stars 187 forks source link

Questions about edm-multi #21

Open eric-bunch opened 8 years ago

eric-bunch commented 8 years ago

I've been going through the code for edm-multi.cpp, and I some general confusion about the algorithm being carried out. I'll keep my questions limited at first so that perhaps the answers to them can alleviate my confusion on the rest.

  1. Is the statistic used in the algorithm to detect multiple breakouts (in edm-multi.cpp) to measure the significance of a breakout at a certain point the same as in that for detection of a at most one breakout (i.e. in edmTail.cpp)? I feel like I understand the process used in edmTail, especially thanks to the paper written introducing the breakout detection algorithm. But it looks like edm-multi measures something slightly different--as best as I can tell it's looking for a shift in the median with some sort of optional penalization.
  2. Is it possible to get some explanation as to what is going on in lines 73-78 of edm-multi.cpp? I think one of the main things confusing me about this is the F[t] term in the definition of tmp. It seems like this would have an unwanted accumulative effect on the statistic.
  3. Am I correct in saying that when *G = Quadratic, breakouts that occur earlier in the time series are favored more than those that occur later? It seems like the more breakouts that have been observed, the more this term will penalize the tmp statistic.
putnam120 commented 8 years ago

Thanks for your interest in our work. Below are the answers to your questions. Don't hesitate to ask followup questions if you have any, or any other questions for that matters.

  1. In the method to detect multiple breakouts there is no significance test unfortunately. This just tries to maximize the divergence measure between adjacent segments. While using penalization to fight against overfitting/oversegmentation.
  2. The point of F[s] is to store the value of the optimal segmentation of the first s observations. Thus the best way to segment the first t observations is to assume the last breakpoint is a s<t and then add on the change caused by the new segment. You then iterate over all possible values of s and keep the best one.
  3. No having *G = Quadratic should not favor earlier breakouts. This choice will more negatively penalize the use of additional breakouts, as compared to using the linear penalization option. One thing to remember is that the breakpoints used to segment the first t observations isn't necessarily a subset of the breakpoints used to segment the first t+n (n>0) observations.

Sorry for the delayed response, but hopefully I was able to answer your questions.

On Wed, Feb 17, 2016 at 1:31 PM, Eric Bunch notifications@github.com wrote:

I've been going through the code for edm-multi.cpp, and I some general confusion about the algorithm being carried out. I'll keep my questions limited at first so that perhaps the answers to them can alleviate my confusion on the rest.

1.

Is the statistic used in the algorithm to detect multiple breakouts (in edm-multi.cpp) to measure the significance of a breakout at a certain point the same as in that for detection of a at most one breakout (i.e. in edmTail.cpp)? I feel like I understand the process used in edmTail, especially thanks to the paper written introducing the breakout detection algorithm. But it looks like edm-multi measures something slightly different--as best as I can tell it's looking for a shift in the median with some sort of optional penalization. 2.

Is it possible to get some explanation as to what is going on in lines 73-78 of edm-multi.cpp? I think one of the main things confusing me about this is the F[t] term in the definition of tmp. It seems like this would have an unwanted accumulative effect on the statistic. 3.

Am I correct in saying that when *G = Quadratic, breakouts that occur earlier in the time series are favored more than those that occur later? It seems like the more breakouts that have been observed, the more this term will penalize the tmp statistic.

— Reply to this email directly or view it on GitHub https://github.com/twitter/BreakoutDetection/issues/21.

eric-bunch commented 8 years ago

Thanks for the response. With regards to #1, that makes more sense with how I was understanding the algorithm. So there is no way to make use of the significance measure described in the paper for multiple breakouts? Would it be possible to determine the bestStat value for each point and each segmentation, and then do some sort of filtering process based on the min_size and some sort of tolerance parameter? Or is that approach too naive?

I think the answer to 2 makes sense. I think maybe I'm overthinking it, but I'll give it some time to digest.

So the penalization term beta*G(number[t]) will only be nonzero when there have been breakpoints detected in the segment associated with t, correct?

Another question I'm having is will the left segment ever be the union of two disjoint intervals? The reason I'm asking that is on lines 47 and 48 the values {Z[prev[min_size - 1]], ..., Z[min_size - 2]} are inserted into the left tree. But since the value of prev[s] is only ever changed on line 78, and s is iterating from 2*min_size ton, then prev[min_size - 1] never gets altered. So prev[min_size - 1] will always be 0. Then on line 54, the left tree is filled further so that in total it has {Z[0], ...., Z[t-1]}. Then going to lines 61 & 62, if prev[t] > prev[t-1], the elements {Z[prev[t-1]],...,Z[prev[t]]} are removed from the left tree, leaving {Z[0],...,Z[prev[t-1]-1], Z[prev[t]+1,...,Z[t-1]]} in the left tree. This seems odd to me; I would expect that something like only {Z[prev[t]]+1,...,Z[t-1]} in the left tree.

By the way, I really enjoyed reading through the paper (I'm assuming you are the author Nicholas on that), and think the method is really cool. And the package you've written is great. I don't know how the multi version works quite yet, but it works really well!

putnam120 commented 8 years ago

I'm so sorry about the delay on getting back to your latest questions.

For your questions about beta*G(number[t]) it makes sense that the function is equal to zero when number[t] = 0. However, this doesn't have to be the case. Just as long a G( ) is an increasing function it's use should still make sense.

Actually, Z[0], Z[1], ... , Z[prev[t]] - 1 are removed from the left tree. You are correct in that prev[min_size - 1] will always be zero but that's fine. I think you forgot that the first time this happens that prev[t-1] =

  1. And after that it will be updated. Note that when iterating the value of t we always start at min_size so we avoid the problem that you mentioned.

Hope that helps. Once again sorry for the delay.

On Fri, Feb 19, 2016 at 2:55 PM, Eric Bunch notifications@github.com wrote:

Thanks for the response. With regards to #1 https://github.com/twitter/BreakoutDetection/issues/1, that makes more sense with how I was understanding the algorithm. So there is no way to make use of the significance measure described in the paper for multiple breakouts? Would it be possible to determine the bestStat value for each point and each segmentation, and then do some sort of filtering process based on the min_size and some sort of tolerance parameter? Or is that approach too naive?

I think the answer to 2 makes sense. I think maybe I'm overthinking it, but I'll give it some time to digest.

So the penalization term beta*G(number[t]) will only be nonzero when there have been breakpoints detected in the segment associated with t, correct?

Another question I'm having is will the left segment ever be the union of two disjoint intervals? The reason I'm asking that is on lines 47 and 48 the values {Z[prev[min_size - 1]], ..., Z[min_size - 2]} are inserted into the left tree. But since the value of prev[s] is only ever changed on line 78, and s is iterating from 2*min_size ton, then prev[min_size - 1] never gets altered. So prev[min_size - 1] will always be 0. Then on line 54, the left tree is filled further so that in total it has {Z[0], ...., Z[t-1]}. Then going to lines 61 & 62, if prev[t] > prev[t-1], the elements {Z[prev[t-1]],...,Z[prev[t]]} are removed from the left tree, leaving {Z[0],...,Z[prev[t-1]-1], Z[prev[t]+1,...,Z[t-1]]} in the left tree. This seems odd to me; I would expect that something like only {Z[prev[t]]+1,...,Z[t-1]} in the left tree.

By the way, I really enjoyed reading through the paper (I'm assuming you are the author Nicholas on that), and think the method is really cool. And the package you've written is great. I don't know how the multi version works quite yet, but it works really well!

— Reply to this email directly or view it on GitHub https://github.com/twitter/BreakoutDetection/issues/21#issuecomment-186441200 .