Closed anjani-bhat closed 3 years ago
This is partially done: It includes all the changes for multivariate Energy Statistics calculations. This does not include the changes to E-Divisive Means to make it multivariate.
Observations upon investigation of different change points:
test_detection
> test_large
, test_very_large
and test_huge
) that are incorrect. These are the ones with the @pytest.mark.slow
and don’t get run unless explicitly specified and probably that is why it was never caught before merging in the code. ( test_large
produces 34 cps instead of 31 specified in the test file, test_very_large
produces 69 instead of 61 as specified in the test file, test_huge
produces 138 instead of 132 as specified in the test file)np.random.shuffle
and random.shuffle
produces different shuffle order for the same seed.np.random.shuffle
to random.shuffle
in the new implementation and ran the it - but the results observed were not the same for large timeseries. qhat
formula to t formula in old implementation and ran again. The order of the change points found for really large time series becomes different from old implementation. For huge time series the number of change points found varies as well.
So we can conclude that the changes in the number of change points detected are due to: 1 and 2 above and the change from qhat
to t-stat
formula.If we want the same change points to be generated as the older implementation for MOST of the time series (cannot guarantee this for all time series because of change made from qhat
to t-stat
. The changes are mostly seen in large time series.):
np.random.shuffle
with random.shuffle
seed
1234 for change point calculation in edivisive.An example for this change:
For qhat based calculation - the change points observed for very_large.json
:
[2364, 2318, 25, 64, 114, 173, 198, 237, 287, 346, 371, 410, 460, 519, 544, 583, 633, 692, 717, 756, 806, 865, 890, 929, 979, 1038, 1063, 1102, 1152, 1211, 1236, 2190, 2145, 2249, 2274, 2017, 1972, 2076, 2101, 1844, 1799, 1903, 1928, 1671, 1626, 1730, 1755, 1498, 1453, 1557, 1582, 1325, 1279, 1384, 1409, 69, 242, 415, 588, 761, 934, 1107, 1448, 1621, 1794, 1967, 2140, 2313, 1275]
For t-stat based calculation this changed to:
[2364, 2318, 25, 2190, 2145, 2249, 2274, 2017, 1972, 2076, 2101, 1844, 1799, 1903, 1928, 1671, 1626, 1730, 1755, 1498, 1453, 1557, 1582, 1325, 1280, 1384, 1409, 1152, 1107, 1211, 1236, 979, 934, 1038, 1063, 806, 761, 865, 890, 633, 588, 692, 717, 460, 415, 519, 544, 287, 242, 346, 371, 114, 68, 173, 198, 237, 410, 583, 756, 929, 1102, 1275, 1448, 1621, 1794, 1967, 2140, 2313]
The length reduced from 69 to 68 and the order of the change points also changed as above.
evergreen merge
Our initial approach to doing get_stats_and_probs seemed fast enough for smaller samples:
In the image above - the first function is for the combined distance matrix (new/optimized method) based permutation test and the second one is the way we initially planned to do it (shuffling the combined inputs).
However, upon increasing the size of the arrays, our initial approach was extremely slow:![image](https://user-images.githubusercontent.com/72042726/110661356-7bf82380-8192-11eb-9b17-c757a95c3d66.png)
Hence, worked on some changes in the permutation test implementation to optimize the calculations. The description of this method is in the comments in the method
get_statistics_and_probabilities
.