kLabUM / rrcf

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams
https://klabum.github.io/rrcf/
MIT License
488 stars 111 forks source link

Is the threshold of the Anomaly score dynamic for the Streaming detection? #81

Open shihyulee opened 3 years ago

shihyulee commented 3 years ago

On the streaming data, is it necessary to design the dynamic threshold of the anomaly score to adapt over time as the input data changes?

Thank you.

yasirroni commented 3 years ago

In my case, static model with static threshold is prefered. Here is the code:

# Set tree parameters
NUM_TREES = 100 # around 100 trees or more is recommended
SHINGLE_SIZE = 10 # number of data treated as single data
NUM_SHINGLE = n_points - SHINGLE_SIZE + 1 # number of shingle
TREE_SIZE = NUM_SHINGLE # number of points in the tree for fifo, minutes in a day are 1440
SPROUT_PROBABILITY = 0.8 # the probability of leaf grow on single tree independently, lower means better generality

TREE_FULL_FLAG = False # flag to mark tree status

# Create a forest of empty trees
forest = [None] * NUM_TREES
for i in range(NUM_TREES):
    tree = rrcf.RCTree()
    forest[i] = tree

# Use the "shingle" generator to create rolling window
points = rrcf.shingle(X, size=SHINGLE_SIZE)

# transpose
def shingle_transpose(shingle):
    for val in shingle:
        yield val.T.flatten()

points = shingle_transpose(points)

# create a dict to store anomaly score of each point
avg_codisp = pd.Series(0.0, index=np.arange(NUM_SHINGLE))

# create a dict to store anomaly score of each point
avg_codisp = pd.Series(0.0, index=np.arange(NUM_SHINGLE))

# for each shingle...
for index, point in enumerate(points):
    # record number of tree that save the shingle
    freq = 0
    # for each tree in the forest...
    for tree in forest:
        # roll dice to determine point become a leave or not
        if np.random.random() <= SPROUT_PROBABILITY:
            # insert the new point into the tree
            tree.insert_point(point, index=index)

            # add 1 to freq record
            freq += 1

        # if tree already full
        if len(tree.leaves) == TREE_SIZE: 
            TREE_FULL_FLAG = True

    # if no tree save current shingle
    if freq == 0:
        # chose any tree
        tree = np.random.choice(forest)

        # insert the new point into the tree
        tree.insert_point(point, index=index)

        # if tree already full
        if len(tree.leaves) == TREE_SIZE: # if TREE_SIZE
            TREE_FULL_FLAG = True

    # if any tree is above permitted size...
    if TREE_FULL_FLAG:
        break

# Compute codisp
freq = np.zeros(index + 1)
for tree in forest:
    codisp = pd.Series({leaf : tree.codisp(leaf) for leaf in tree.leaves})
    avg_codisp[codisp.index] += codisp
    np.add.at(freq, codisp.index.values, 1)

for i in range(index + 1):
    if freq[i] >= 1:
        avg_codisp[i] = avg_codisp[i] / freq[i]

# After any tree is above permitted size and if points haven't empty, for each shingle...
for point in points:
    # Continue index counter
    index += 1

    codisp = 0
    freq = 0
    for tree in forest:
        # Roll dice to plant or not
        if np.random.random() <= SPROUT_PROBABILITY:
            # Check if leave can sprout
            if len(tree.leaves) == TREE_SIZE:
                # Drop the oldest point (FIFO)
                tree.forget_point(min(tree.leaves.keys()))

            # Insert the new point into the tree
            tree.insert_point(point, index=index)

            codisp += tree.codisp(index)
            freq += 1

    if freq == 0: # leaf not sprouted at any tree
        tree = np.random.choice(forest)
        # Check if leave can sprout
        if len(tree.leaves) == TREE_SIZE:
            # Drop the oldest point (FIFO)
            tree.forget_point(min(tree.leaves.keys()))

        # Insert the new point into the tree
        tree.insert_point(point, index=index)

        codisp += tree.codisp(index)
        freq += 1

    avg_codisp[index] = codisp / freq

plt.plot(avg_codisp)
plt.show()

Then, I will double (based on your case) the maximum value of train CoDisp as the threshold, and you can use insert and delete the new point in testing or deployment.