mongodb / signal-processing-algorithms

Apache License 2.0
39 stars 16 forks source link

DAG-681: Add Energy Statistic Homogeneity Calculator #16

Closed anjani-bhat closed 3 years ago

anjani-bhat commented 3 years ago

Our initial approach to doing get_stats_and_probs seemed fast enough for smaller samples:

Screen Shot 2021-03-10 at 11 14 37 AM

In the image above - the first function is for the combined distance matrix (new/optimized method) based permutation test and the second one is the way we initially planned to do it (shuffling the combined inputs).

However, upon increasing the size of the arrays, our initial approach was extremely slow: image

Hence, worked on some changes in the permutation test implementation to optimize the calculations. The description of this method is in the comments in the method get_statistics_and_probabilities.

anjani-bhat commented 3 years ago

This is partially done: It includes all the changes for multivariate Energy Statistics calculations. This does not include the changes to E-Divisive Means to make it multivariate.

anjani-bhat commented 3 years ago

Observations upon investigation of different change points:

  1. There are tests currently in master (test_detection > test_large, test_very_large and test_huge) that are incorrect. These are the ones with the @pytest.mark.slow and don’t get run unless explicitly specified and probably that is why it was never caught before merging in the code. ( test_large produces 34 cps instead of 31 specified in the test file, test_very_large produces 69 instead of 61 as specified in the test file, test_huge produces 138 instead of 132 as specified in the test file)
  2. np.random.shuffle and random.shuffle produces different shuffle order for the same seed.
  3. To validate 2, changed np.random.shuffle to random.shuffle in the new implementation and ran the it - but the results observed were not the same for large timeseries.
  4. To further investigate 3, changed the qhat formula to t formula in old implementation and ran again. The order of the change points found for really large time series becomes different from old implementation. For huge time series the number of change points found varies as well. So we can conclude that the changes in the number of change points detected are due to: 1 and 2 above and the change from qhat to t-stat formula.

If we want the same change points to be generated as the older implementation for MOST of the time series (cannot guarantee this for all time series because of change made from qhat to t-stat. The changes are mostly seen in large time series.):

  1. We can replace np.random.shuffle with random.shuffle
  2. Use the seed 1234 for change point calculation in edivisive.
anjani-bhat commented 3 years ago

An example for this change:

For qhat based calculation - the change points observed for very_large.json: [2364, 2318, 25, 64, 114, 173, 198, 237, 287, 346, 371, 410, 460, 519, 544, 583, 633, 692, 717, 756, 806, 865, 890, 929, 979, 1038, 1063, 1102, 1152, 1211, 1236, 2190, 2145, 2249, 2274, 2017, 1972, 2076, 2101, 1844, 1799, 1903, 1928, 1671, 1626, 1730, 1755, 1498, 1453, 1557, 1582, 1325, 1279, 1384, 1409, 69, 242, 415, 588, 761, 934, 1107, 1448, 1621, 1794, 1967, 2140, 2313, 1275] For t-stat based calculation this changed to: [2364, 2318, 25, 2190, 2145, 2249, 2274, 2017, 1972, 2076, 2101, 1844, 1799, 1903, 1928, 1671, 1626, 1730, 1755, 1498, 1453, 1557, 1582, 1325, 1280, 1384, 1409, 1152, 1107, 1211, 1236, 979, 934, 1038, 1063, 806, 761, 865, 890, 633, 588, 692, 717, 460, 415, 519, 544, 287, 242, 346, 371, 114, 68, 173, 198, 237, 410, 583, 756, 929, 1102, 1275, 1448, 1621, 1794, 1967, 2140, 2313]

The length reduced from 69 to 68 and the order of the change points also changed as above.

anjani-bhat commented 3 years ago

evergreen merge