This project contains our implementation of DISC
, a method for discovering and explaining components in binary datasets. For more details see S. Dalleiger and J. Vreeken, Explainable Data Decompositions, AAAI (2020)
Our goal is to partition a dataset into components, but unlike regular clustering algorithms we additionally want to describe why there are components, how they are different from each other, and what properties are shared among components, in terms of informative patterns observed in the dataset. Our notion of components is therefore different from the typical definition of clusters, which are often defined as a high-similarity or high-density regions in the dataset. Instead, we consider components as regions in the data that show significantly different pattern distributions.
We define the problem in terms of a regularized maximum likelihood, in which we use the Maximum Entropy principle to model each data component with a set of patterns. As the search space is large and unstructured, we propose the deterministic DISC algorithm to efficiently discover high-quality decompositions via an alternating optimization approach. Empirical evaluation on synthetic and real-world data shows that DISC efficiently discovers meaningful components and accurately characterises these in easily understandable terms.
At the moment of this writing, the Parallel STL of g++ (>=10.1.1) requires TBB and is incompatible with TBB 2021.2
On Debian or Ubuntu you can obtain these for example using
apt install libboost-dev g++ cmake libtbb2 libtbb-dev
On Fedora you can obtain these for example using
dnf install boost-devel g++ cmake tbb-devel
On MacOS you can obtain these for example using Homebrew and
brew install boost gcc libomp cmake tbb
For higher precision floating points we can optionally make use of the non-standard 128 bit float type, however, for this we require GNU Quadmath, Boost.Multiprecision and g++.
If the dependencies above are met, you can simply install the python version using
pip install git+https://github.com/sdall/disc-aaai20.git
and simply import the algorithms using
from disc import disc, desc
These functions either expect a binary numpy matrix or a sparse representation of the binary data matrix, that is $B_{ij} = 1$ implies that in row $i$ we can find the value $j$. For example $B = \begin{matrix} 0 & 1 \ 1 & 1 \end{matrix}$ implies
data = [[0], [0, 1]]
For this given, we can now discover an informative pattern set
desc(data)
or we can use a numpy matrix directly, e.g.
import numpy
desc(numpy.random.uniform(size=(10,5)) > 0.5)
It returns a python dictionary with information such as the summary, (initial) objective function and frequencies of singletons and patterns. If we have multiple datasets given, i.e. classes, DESC can describe these in terms of characteristic and shared patterns
class_labels = [0, 1]
desc(data, class_labels)
Additionally, we are now provided with the assignment matrix that assigns a pattern to component (class) if that pattern is informative for that class.
However, if we have no class labels and if we are interested in the parts of the data that exhibit a significantly different distribution from the rest, we can make use of DISC to jointly discover and describe these, i.e.
disc(data)
Again, we are provided with the assignment matrix, but in addition, we also have access to the estimated class labels.
For a complete example see the notebook jupyter/disc-iris.ipynb
.
The R interface has not received major updates and is unmaintained at the moment
If all requirements are met, starting from the root-directory of this project you can compile and import the Rcpp based bindings by using
install.packages('Rcpp')
require(Rcpp)
inc = paste('-std=c++17 -DWITH_EXECUTION_POLICIES=1 ', '-I ', getwd(), '/include', ' -I ', getwd(), '/src', sep='')
Sys.setenv("PKG_CXXFLAGS"=inc)
Sys.setenv("PKG_LIBS"="-ltbb")
sourceCpp("src/bindings/R/RDisc.cpp")
Similar to the python interface you now have access to desc and disc
data <- list(c(1, 2, 3), c(2, 3, 4))
labels <- list(0, 1)
desc(data)
desc(data, labels)
disc(data)