linkedin / luminol

Anomaly Detection and Correlation library
Apache License 2.0
1.19k stars 217 forks source link

Passed max_shift_seconds in CrossCorrelator not coverted to milliseconds #38

Open earthgecko opened 6 years ago

earthgecko commented 6 years ago

In src/luminol/algorithms/correlator_algorithms/cross_correlator.py if max_shift_seconds is passed, it is not converted to milliseconds like the DEFAULT_ALLOWED_SHIFT_SECONDS

earthgecko commented 6 years ago

Are milliseconds needed? Are milliseconds breaking the Correlator?

On further evaluation it appears as if the specific use of milliseconds in the CrossCorrelator has an undesired effect in the subsequent _find_allowed_shift and _find_first_bigger which is using residual_timestamps (which are not necessarily in milliseconds).

Perhaps I am missing something. The documentation refers to epoch seconds, there is no mention of milliseconds another that in these functions so far as I can tell. Although there is the `def timestamps_ms it does not appear to be called anywhere. Therefore as it stands, it appears that the the correlator is always going to shift epoch timestamps far too many positions.

Example:

timestamps = [
    1515337587,
    1515337647,
    1515337707,
    1515337767,
    1515337827,
    1515337887,
    1515337947,
    1515338007,
    1515338067,
    1515338127,
    1515338187]

# def _find_allowed_shift(self, timestamps):
init_ts = timestamps[0]
residual_timestamps = [ts - init_ts for ts in timestamps]
n = len(residual_timestamps)

residual_timestamps
>> [0, 60, 120, 180, 240, 300, 360, 420, 480, 540, 600]

#def _find_first_bigger(self, timestamps, target, lower_bound, upper_bound):
target = 60000  # default max_shift_milliseconds
lower_bound = 0
upper_bound = n
while lower_bound < upper_bound:
    pos = lower_bound + (upper_bound - lower_bound) / 2
    pos = int(pos)
    if residual_timestamps[pos] > target:
        upper_bound = pos
    else:
        lower_bound = pos + 1
pos
>> 10
# if you use target = 60 pos is 1

This is not the expected result or I am missing something?

If it is not the desired result, standardise everything to milliseconds or just have milliseconds as an argument if desired? I can imagine that milliseconds may be useful in some cases however the majority of current time series data is commonly going to be to second resolution, so as an argument and default being seconds would probably be convenient to most new users.

earthgecko commented 6 years ago

Updated the PR to remove milliseconds - https://github.com/linkedin/luminol/pull/39 - changed the use of milliseconds to seconds and removed reference to milliseconds as it is not really used and when it is used, it is used somewhat erroneously.