orenefraty commented 5 years ago

Does the implementation support in timeDecay? Meaning how much of recent past to consider when computing an anomaly? Same question about Attribution score (how much a dimension contributed to anomaly, where sum of all attributions equal to record anomaly score)

Also, regarding examples - can you please add an example of high-dimensional time-series with shingling?

Thank you!

nerooooooz commented 5 years ago

Same question on timeDecay...and I also wonder if this one is good for trending data because there are a lot of points with high codisp in my test for trending data (no anomalys)...

mdbartos commented 5 years ago

There are a lot of different ways one could implement time decay. Here's one way using batch mode. I just used a series of rolling gaussian pdfs to define the sampling frequency.

import numpy as np
import pandas as pd
import scipy.stats
import rrcf
from matplotlib import pyplot as plt
import seaborn as sns

# Read data
taxi = pd.read_csv('../resources/nyc_taxi.csv', index_col=0)
taxi.index = pd.to_datetime(taxi.index)
data = taxi['value'].astype(float).values
n = data.shape[0]

# Set tree parameters
num_trees = 200
shingle_size = 48
tree_size = 1000

# Use the "shingle" generator to create rolling window
points = rrcf.shingle(data, size=shingle_size)
points = np.vstack([point for point in points])
num_points = points.shape[0]

# Chunk data into segments
num_chunks = 40
chunk_size = num_points // num_chunks
chunks = (chunk * chunk_size for chunk in range(num_chunks))

forest = []
pdfs = []
for chunk in chunks:
    # Create probability density function for sample
    g = scipy.stats.norm(loc=chunk, scale=400).pdf(np.linspace(0, num_points, num_points))
    g *= (1/g.sum())
    # Sample data with probability density function defined by g
    ixs = np.random.choice(num_points, size=(num_points // tree_size, tree_size), 
                                             replace=False, p=g)
    # Create trees from sampled points
    trees = [rrcf.RCTree(points[ix], index_labels=ix) for ix in ixs]
    forest.extend(trees)
    # Store pdf for plotting purposes
    pdfs.append(g)

# Compute CoDisp
avg_codisp = pd.Series(0.0, index=np.arange(num_points))
index = np.zeros(num_points)
for tree in forest:
    codisp = pd.Series({leaf : tree.codisp(leaf)
                        for leaf in tree.leaves})
    avg_codisp[codisp.index] += codisp
    np.add.at(index, codisp.index.values, 1)
avg_codisp /= index
avg_codisp.index = taxi.iloc[(shingle_size - 1):].index

Plotting:

# Plot results
fig, ax = plt.subplots(3, figsize=(10, 9))
(taxi['value'] / 1000).plot(ax=ax[0], color='0.5', alpha=0.8)
avg_codisp.plot(ax=ax[2], color='#E8685D', alpha=0.8, label='RRCF')
for pdf in pdfs[1:]:
    pd.Series(pdf, index=avg_codisp.index).plot(ax=ax[1], c='0.6')    
ax[0].set_title('Taxi passenger timeseries')
ax[1].set_title('Rolling probability density functions')
ax[2].set_title('Anomaly score')
ax[0].set_ylabel('Thousands of passengers')
ax[1].set_ylabel('p(x)')
ax[2].set_ylabel('CoDisp')
ax[0].set_xlabel('')
ax[1].set_xlabel('')
ax[2].set_xlabel('')
plt.tight_layout()
plt.savefig('decay_example.png')

decay_example

vidbap commented 4 years ago

Awesome example, thank you @mdbartos

1. Doesn't the shingle size determine the time decay - how much of the past data to use defined by this rolling window? So, why would we do batching as proposed and advantages of doing a sampling on it?

2. Attribution Score - I would also be interested in adding this as a feature. AWS has a random cut forest implementation in Kinesis Stream Analytics with "explanation" that does exactly this, however, we want to use it in our datacenter to track operations anomalies.

Best regards.

Aubindb commented 4 years ago

Hello,

I would also be grateful for an example in high dimension. I managed to adapt the shingling to get multidimensional rolling windows I'm struggling to adapt the rest of the batch example.

Thank you!!

kLabUM / rrcf

timeDecay & High-dimensional shingling #58

1. Doesn't the shingle size determine the time decay - how much of the past data to use defined by this rolling window? So, why would we do batching as proposed and advantages of doing a sampling on it?

2. Attribution Score - I would also be interested in adding this as a feature. AWS has a random cut forest implementation in Kinesis Stream Analytics with "explanation" that does exactly this, however, we want to use it in our datacenter to track operations anomalies.