Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data

Title: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data Authors: Hans-Peter Kriegel, Peer Kr¨oger, Erich Schubert, and Arthur Zimek Paper: https://www.dbs.ifi.lmu.de/Publikationen/Papers/pakdd09_SOD.pdf Tags: [data mining][outlier detection] [Data Stream]

A model called SOD which searches outliers in high dimensional data using subspaces is presented
The authors argue that most of the existing approaches rely on the full-dimensional Euclidian data space (i.e. full data set ) to find outliers
With SOD, the authors explore the axis-parallel subspace spanned by neighbors of each object in the data set and determine how much the object deviates from the neighbors in this subspace. The object is projected on the axis and the deviation from its neighboring objects determine the outlierness.
The proposed method is particularly useful for high dimensional data where outliers cannot be found in the entire feature space but in different subspaces of the original space.
The search for outliers must be coupled with the search for the relevant subspaces.
The SOD implicitly provides not only a quantitative outlier model but also a qualitative outlier model by specifying for each outlier the features that are relevant for the outlierness. Thus, in contrast to most of the existing approaches, the SOD model also gives an explanation why a point p is an outlier.
The SOD algorithm relies on two input parameters and a constant α = 0.8 found by experiment.
1. k which specifies the number of nearest neighbors that are considered to compute the shared nearest neighbor similarity.
2. l which specifies the size of the reference sets.
The authors demonstrated that SOD performed better that LOF or ABOD with experiments conducted using ELKI-framework

My assessment The SOD algorithm surface the same issue of arbitrarily selected constants k and l. It is hard and time consuming to get optimal constants experimentally

tsukuba-kde / papers

Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data #22