vc1492a / PyNomaly

Anomaly detection using LoOP: Local Outlier Probabilities, a local density based outlier detection method providing an outlier score in the range of [0,1].
Other
305 stars 36 forks source link

Changes to distance measure implementation to improve speed #38

Closed nghiadanh26 closed 3 years ago

nghiadanh26 commented 4 years ago

Hello authors, I have worked with a range of anomaly detection algorithms. I have using LoOP for my testbed and the time consumed is very high. More particularly, my training data includes about 46000 points with two features for each and the number of clusters is 1 (only normal traffic with label 0). It took me about 11000 seconds for the training phase and 0.5s for testing phase (with k=20). when I reviewed the code in loop.py file, I saw you use two loops in the function: def _distances(self, progress_bar: bool = False) -> None: I have already rewritten this function and ignored the cluster (because I do not need it) and also use functions in numpy to reduce this function to only one 'for' loop. It only takes me about 120s instead. This function likes: def distance_dn(self,point_vector): """ Calculate distances from each point to the remaining points """ data = point_vector k = self.n_neighbors distance = np.zeros((len(data),k)) index = np.zeros((len(data),k)) t1 =time.time() for i in range(len(data)): data_i = np.array([[data[i][0],data[i][1]]]) point_arr = np.repeat(data_i,len(data),axis=0) diff = (point_arr - data)2 dis = (np.sum(diff,axis=1))0.5 index[i] = (np.argpartition(dis,k))[0:k] distance[i] = dis[(np.argpartition(dis,k))[0:k]]

print(distance)

    t2 = time.time()
    #print(t2-t1)
    return distance, index

I hope here can be my contribution for LoOP?

vc1492a commented 4 years ago

Hi @nghiadanh26 thanks for opening this issue! Would you be able to please open a pull request with your updates using the contribution guidelines outlined in readme.md? That way, we can test your updates to ensure they work as intended when using numba as well which dramatically accelerates the speed of our current implementation (which may be an option to you as well in addition to your changes above).

Once you open the PR, I'll be in a better position to review your changes and perhaps integrate your contributions, thanks!

nghiadanh26 commented 4 years ago

Hi @vc1492a, My code is only for my work, I'm not sure it can overcome the test. Because I just commented on a part of your code and replaced it with a new code ignore many things like cluster and numba. Therefore, I think it may be not true if dataset contains many clusters. If you want I still can make a pull request without passing the test.

vc1492a commented 4 years ago

Thanks! If you could open that PR, that would be great! I know it may not pass tests but that would be the most organized way to get started on reviewing your proposed changes.

nghiadanh26 commented 4 years ago

I have already made a pull request. And no surprise, the test was failed as predicted. Now you can check my code and that would be great if you give me feedback.

vc1492a commented 4 years ago

Thanks @nghiadanh26, will find some time to take a look in the next several weeks. Thanks for opening the PR!

vc1492a commented 3 years ago

Closing this issue and the associated pull request due to lack of activity. See the discussion in #39.