taulanti / randomforest-matlab

Automatically exported from code.google.com/p/randomforest-matlab
0 stars 0 forks source link

Non-Efficient Memory Allocation for RF Storage #30

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
This is a great MATLAB transportation of Random Forest. Thank you very much.

However, I found the memory allocated to store the RF (for classification) 
model is very non-efficient. Specifically, there are a lot of non-necessary 
elements stored in treemap, nodestatus, nodeclass, bestvar, xbestsplit, and 
ndbigtree. 

I will take twonorm.mat as an example for illustrain purpose, where only one 
tree is trained for simplicity. 
    >> load ./data/twonorm.mat
    >> model = classRF_train(inputs', outputs, 1);
Dimension of variables in the model are listed below (on my computer with 
64-bit Windows and MATLAB 2011b):

        Value
nrnodes     601
treemap     <601*2 int32>
nodestatus  <601*1 int32>
nodeclass   <601*1 int32>
bestvar     <601*2 int32>
xbestsplit  <601*2 int32>
ndbigtree   <601*2 int32>

Actually, ndbigtree denotes the number of nodes in each tree, in which only the 
first #model.ntree (here only 1) elements are useful and the rest are all zero. 
In my output, ndbigtree(1) is 

63. I checked on the predictClassTree() function in classTree.cpp to see how 
the prediction is made based on the tree hierarchy. I found that only the first 
#model.ntree (here only 1) elements in nodestatus, nodeclass, bestvar and 
xbestsplit are useful. The index of the left and right child of the kth node is 
treemap[2*k]-1 and treemap[2*k+1]-1, respectively. 

I am not sure why there are only 63 nodes in the tree, but we have to store as 
much as 601 (#nrnodes) elements in say nodestatus, and 2*63 elements in 
treemap. It will cost a great deal of extra memory to store RF with many trees 
trained on large number of samples. Is that possible to improve the momery 
allocation?

Original issue reported on code.google.com by jianghua...@gmail.com on 7 Apr 2012 at 10:42

GoogleCodeExporter commented 9 years ago
hi

thanks for the detailed bug

yeh, there is inefficient storage going on. i intend to fix ndbigtree in 
sometime

the other variables are tightly wound up in the code. the total storage 
required is nrnodes (rows) x ntree (cols), and its stored in a row first order 
which causes each tree to have atleast nrnodes (ntree is fixed but number of 
nodes in tree are atmost nrnodes). if it was a col first order than the nrnodes 
and the storage size could be reallocated (increased) at run-time. i haven't 
written the original code (its from randomforest-r) and i dont want to mess 
anything with it so i  haven't made any modifications to it.

Original comment by abhirana on 10 Apr 2012 at 1:04