vatsalsharan / pidforest

Code for the PIDForest algorithm for anomaly detection
MIT License
28 stars 6 forks source link

No entropy in dimension: X #12

Open xuyxu opened 4 years ago

xuyxu commented 4 years ago

Hi, thank you for sharing the codes. I encountered one problem when using PIDForest on ionosphere dataset, which is also publicly available (http://odds.cs.stonybrook.edu/ionosphere-dataset/)

Concretely, my problem belongs to the setting of clean training data (i.e., no anomalies in the X_train), and the goal is to evaluate X_test mixed up with both inliers and anomalies.

After calling forest.fit(X_train), one warning appears: No entropy in dimension : 0, and the program goes on. However, when calling forest.predict(X_test, ...), PIDForest returns one error:

line 78, in predict:
self.tree[i].compute_split(pts, indices, scores[i])

IndexError: list index out of range

Below are the codes that reproduce the problem. Can anyone help me to solve this problem? Thanks a lot !

import numpy as np
import scipy.io as sio
from forest import Forest
from pyod.utils.data import evaluate_print
from sklearn.model_selection import train_test_split

datasets = ['ionosphere']

L = len(datasets)
trials = 10

for i in range(0, L):

    data = sio.loadmat('../data/'+datasets[i]+'.mat')
    X, y = data["X"], data["y"]
    inlier_X, inlier_y = X[y.reshape(-1) == 0, :], y[y.reshape(-1) == 0, :]
    outlier_X, outlier_y = X[y.reshape(-1) == 1, :], y[y.reshape(-1) == 1, :]    

    for j in range(0, trials):

        np.random.seed(j)        

        X_train, tmp_X, y_train, tmp_y = train_test_split(inlier_X, inlier_y, test_size=0.4, random_state=j)
        X_test, y_test = np.vstack((tmp_X, outlier_X)), np.vstack((tmp_y, outlier_y))        

        n_samples = 100
        kwargs = {'max_depth': 10,
                  'n_trees': 50,
                  'max_samples': n_samples,
                  'max_buckets': 3,
                  'epsilon': 0.1,
                  'sample_axis': 1,
                  'threshold': 0}

        forest = Forest(**kwargs)

        forest.fit(np.transpose(X_train))
        indices, outliers, scores , pst, our_scores = forest.predict(np.transpose(X_test),
                                                                     err = 0.1,
                                                                     pct=50)
xuyxu commented 4 years ago

It looks like that this problem happens if all training samples take the same value on a dimension. I would like ask that what is the suggested solutions on handling such datasets, is simply removing the dimensions with same values appropriate ?