h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

Review Ability to Get a Proximity Matrix from DRF #9351

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Review the computational efficiency of returning a proximity matrix with H2O-3's Distributed Random Forest and consider adding something similar to the proximity matrix that R's [Random Forest implementation|https://cran.r-project.org/web/packages/randomForest/randomForest.pdf] returns.

Note: The size of the matrix can be a limiting factor, in certain cases it may be impossible to calculate the full matrix - one solution to this, could be to keep just N of the most similar rows. Breiman also ran into computation issues with his Random Forest implementation for the proximity matrix:


Proximities
These are one of the most useful tools in random forests. The proximities originally formed a NxN matrix. After a tree is grown, put all of the data, both training and oob, down the tree. If cases k and n are in the same terminal node increase their proximity by one. At the end, normalize the proximities by dividing by the number of trees.

Users noted that with large data sets, they could not fit an NxN matrix into fast memory. A modification reduced the required memory size to NxT where T is the number of trees in the forest. To speed up the computation-intensive scaling and iterative missing value replacement, the user is given the option of retaining only the nrnn largest proximities to each case.

When a test set is present, the proximities of each case in the test set with each case in the training set can also be computed. The amount of additional computing is moderate.```

more information on this [here|https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prox].

Possible use case: using the proximity matrix to evaluate distance between records
exalate-issue-sync[bot] commented 1 year ago

Nidhi Mehta commented: #93670 (https://support.h2o.ai/a/tickets/93670) - Re: h2o DRF question

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6270 Assignee: New H2O Bugs Reporter: Lauren DiPerna State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A