h2oai / h2o4gpu

H2Oai GPU Edition
Apache License 2.0
459 stars 94 forks source link

Use Intel DAAL for CPU side of many algorithms #33

Open pseudotensor opened 7 years ago

pseudotensor commented 7 years ago

https://software.intel.com/en-us/intel-daal/details

Algorithms

Data Analysis: Characterization, Summarization, and Transformation

Low Order Moments Computes the basic dataset characteristics such as sums, means, second order raw moments, variances, standard deviations, etc.

Quantile Computes quantiles that summarize the distribution of data across equal-sized groups as defined by quantile orders.

Correlation and Variance-Covariance Matrices Quantifies pairwise statistical relationship between feature vectors.

Cosine Distance Matrix Measures pairwise similarity between feature vectors using cosine distances.

Correlation Distance Matrix Measures pairwise similarity between feature vectors using correlation distances.

Cholesky Decomposition Decomposes a symmetric positive-definite matrix into a product of a lower triangular matrix and its transpose. This decomposition is a basic operation used in solving linear systems, non-linear optimization, Kalman filtration, etc.

QR Decomposition Decomposes a general matrix into a product of an orthogonal matrix and an upper triangular matrix. This decomposition is used in solving linear inverse and least squares problems. It is also a fundamental operation in finding eigenvalues and eigenvectors.

SVD Singular Value Decomposition decomposes a matrix into a product of a left singular vector, singular values, and a right singular vector. It is the basis of Principal Component Analysis, solving linear inverse problems, and data fitting.

PCA Principal Component Analysis reduces the dimensionality of data by transforming input feature vectors into a new set of principal components orthogonal to each other.

K-Means Partitions a dataset into clusters of similar data points. Each cluster is represented by a centroid, which is the mean of all data points in the cluster.

Expectation-Maximization Finds maximum-likelihood estimate of the parameters in models. It is used for the Gaussian Mixture Model as a clustering method. It can also be used in non-linear dimensionality reduction, missing value problems, etc.

Outlier Detection Identifies observations that are abnormally distant from other observations. An entire feature vector (multivariate) or a single feature value (univariate), can be considered in determining if the corresponding observation is an outlier.

Association Rules Discovers a relationship between variables with certain level of confidence.

Linear and Radial Basis Function Kernel Functions Map data onto higher-dimensional space.

Quality Metrics Compute a set of numeric values to characterize quantitative properties of the results returned by analytical algorithms. These metrics include Confusion Matrix, Accuracy, Precision, Recall, F-score, etc.

Machine Learning: Regression, Classification, and More

Neural Networks for Deep Learning A programming paradigm which enables a computer to learn from observational data.

Linear and Ridge Regressions Models relationship between dependent variables and one or more explanatory variables by fitting linear equations to observed data.

Naïve Bayes Classifier Splits observations into distinct classes by assigning labels. Naïve Bayes is a probabilistic classifier that assumes independence between features. Often used in text classification and medical diagnosis, it works well even when there are some level of dependence between features.

Boosting Builds a strong classifier from an ensemble of weighted weak classifiers, by iteratively re-weighting according to the accuracy measured for the weak classifiers. A decision stump is provided as a weak classifier. Available boosting algorithms include AdaBoost (a binary classifier), BrownBoost (a binary classifier), and LogitBoost (a multi-class classifier).

SVM Support Vector Machine is a popular binary classifier. It computes a hyperplane that separates observed feature vectors into two classes.

Multi-Class Classifier Builds a multi-class classifier using a binary classifier such as SVM.

ALS Alternating Linear Squares is a collaborative filtering method of making predictions about the preferences of a user, based on preference information collected from many users.

pseudotensor commented 6 years ago

https://github.com/daaltces/pydaal-tutorials https://github.com/daaltces/pydaal-tutorials/blob/master/kmeans_example.ipynb

navdeep-G commented 6 years ago

Is the idea here to fall back on this for CPU side eventually by default? Or keep default the same and add this as a potential option?

pseudotensor commented 6 years ago

Currently no CPU backend to our kmeans and just fails if we take options in but no GPU or choose n_gpus=0. This provides both a CPU backend and is much faster than scikit.

navdeep-G commented 6 years ago

Okay, makes sense. Seems interesting. Will look further.