tingliu / randomforest-matlab

Automatically exported from code.google.com/p/randomforest-matlab
4 stars 3 forks source link

about the treemap #44

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Hi,everyone.
When I run the randomforest,say, if I have the ntree to be 500, then the 
model.treemap weill be a matrix with the size of 501 X 1000,
and most of the elements are zeros. So what this treemap means? 
Thank you.

Original issue reported on code.google.com by zhangleu...@gmail.com on 26 Sep 2012 at 2:34

GoogleCodeExporter commented 8 years ago
sorry, I use the package in matlab...

Original comment by zhangleu...@gmail.com on 26 Sep 2012 at 2:36

GoogleCodeExporter commented 8 years ago
hi zhang

treemap has the left and right node information for the trees in the forest. 
the variable is used for navigating the tree 

in this code i used treemap to plot the tree
code: 
http://code.google.com/p/randomforest-matlab/issues/attachmentText?id=18&aid=180
001000&name=tutorial_plot_tree.m

relevant information on the treeplotting: 
http://code.google.com/p/randomforest-matlab/issues/detail?id=18&can=1

treemap - stores the tree info. in regression code you have two variables ldau 
and rdau that treemap consists of.
nodestatus = stores whether individual nodes are internal or leaf nodes
nodeclass = the class of the leaf nodes
bestvar = variable that splits the node
xbestsplit = value of the variable that splits the node (> goes to the right 
side, else the left side)

the above variable are all NEEDED for prediction. 

Original comment by abhirana on 26 Sep 2012 at 5:59

GoogleCodeExporter commented 8 years ago
Thank you for your reply.
When I further check the values of the treemap. I found that 
model.treemap(:,tree_num*2) are always zeros. So what do these  zeros stand 
for?  

Original comment by zhangleu...@gmail.com on 26 Sep 2012 at 7:10

GoogleCodeExporter commented 8 years ago
zeros mean that the nodes donot have a daughter. the values map the indices to 
child nodes. so X will mean go to the index model.treemap(X,tree_num*2) to find 
the right child node

btw, i think your condition will happen only if the tree are one sided like 
only growing left or right.

Original comment by abhirana on 26 Sep 2012 at 7:39

GoogleCodeExporter commented 8 years ago
Thanks. But I tried several dataset ,but the index model.treemap(X,tree_num*2) 
are all zeros.And I am quit puzzled about this result.

Original comment by zhangleu...@gmail.com on 26 Sep 2012 at 8:25

Attachments:

GoogleCodeExporter commented 8 years ago
can you send me the model file if possible?

Original comment by abhirana on 26 Sep 2012 at 2:55

GoogleCodeExporter commented 8 years ago
i know the issue

treemap = [model.treemap(:,tree_num*2-1); model.treemap(:,tree_num*2);];
lDau = treemap(1:2:end);  lDau = lDau(1:num_nodes);
rDau = treemap(2:2:end);  rDau = rDau(1:num_nodes);

two columns of treemap are concatenated and generates lDau and rDau. lDau and 
rDau are alternative.

most trees do not occupy nrnodes (max size) and that is the reason why most 
times the second column is empty

Original comment by abhirana on 26 Sep 2012 at 3:01

GoogleCodeExporter commented 8 years ago
Oh,thanks for you reply.
I guess now I have a better understanding of thr treemap now.
I have another question, when we use the model to predict the testing data,
is there existing a way to find which node in each tree does the testing data 
locate?
Thanks.

Original comment by zhangleu...@gmail.com on 27 Sep 2012 at 1:39

GoogleCodeExporter commented 8 years ago
i guess you are looking for node information

http://code.google.com/p/randomforest-matlab/source/browse/trunk/RF_Class_C/tuto
rial_ClassRF.m#249

http://code.google.com/p/randomforest-matlab/source/browse/trunk/RF_Class_C/clas
sRF_predict.m#26

Original comment by abhirana on 27 Sep 2012 at 1:44

GoogleCodeExporter commented 8 years ago
Hi, when I go through the details. I found that the node is not ntest by ntree 
matrix. In my example, if the number of test is 50, the the node matrix is 50 
times 1.

Original comment by zhangleu...@gmail.com on 27 Sep 2012 at 8:14

GoogleCodeExporter commented 8 years ago
When I try the follwing code
 model = classRF_train(X_trn,Y_trn);
clear test_options
test_options.predict_all = 1;
test_options.proximity = 1;
 [Y_hat, votes, prediction_per_tree, proximity_ts] = classRF_predict(X_tst,model,test_options);
 Then there is an error whcih says
??? Error using ==> classRF_predict
Too many output arguments.

Original comment by zhangleu...@gmail.com on 27 Sep 2012 at 8:26

GoogleCodeExporter commented 8 years ago
hi zhang

are you using the latest svn source if not sync to the svn source. or use this 
download link
http://randomforest-matlab.googlecode.com/issues/attachment?aid=410008000&name=r
f-rev55+-+20+Sep+2012.zip&token=DiBZ0BWzfgmEWFULfd4MDOaKvTo%3A1348784889462 (i 
uploaded it in a different issue 
http://code.google.com/p/randomforest-matlab/issues/detail?id=41&can=1)

Original comment by abhirana on 27 Sep 2012 at 10:30

GoogleCodeExporter commented 8 years ago
Ok.Thank you for you advice. I have update the package now.
I still have a question.In each node, RF find the best spit among the randomly 
selected features. But in the package this is just like a black-nox. 
So is there some methods for me to modify the spilting rule in the RF?

Original comment by zhangleu...@gmail.com on 28 Sep 2012 at 3:08

GoogleCodeExporter commented 8 years ago
yeh, RF splits are based on the CART algorithm splits.

nah, the methods are too much imbued that it might be hard to modify the 
splitting rule in RF. 

take a look into findbestsplit function (reg_Rf.cpp for regression. rfsub.f for 
classification) and search for crit

Original comment by abhirana on 28 Sep 2012 at 6:23

GoogleCodeExporter commented 8 years ago
Hi,abhirana,
In the file of tutorial_Proximity_training_test ,how do you calulte the 
Proximity between the training sample and test sample?
Do we need to find the node information about the testing and training sample. 
And if the located in the same node, then the Proximity between them is added 
by 1.
Then we normolize the Proximity matrix.
Is that the way to calcute the Proximity between the test and train sample?

Original comment by zhangleu...@gmail.com on 28 Sep 2012 at 7:36

GoogleCodeExporter commented 8 years ago
give me half a day. i need to fix a bug in the computeproximity routine. 

if i remember somewhat, proximity is calculated somewhat as you described

Original comment by abhirana on 28 Sep 2012 at 7:39

GoogleCodeExporter commented 8 years ago
i just added the bug fix in computeproximity routine and its the svn.

if you dont want to redownload the source, just change
line 245 in RF_Class_C\src\rfutils.cpp  (computeProximity)
from (inbag[i] > 0) ^ (inbag[j] > 0) to (inbag[i] > 0) || (inbag[j] > 0)

i guess you are correct. computeProximity calculates the proximity matrix

Original comment by abhirana on 28 Sep 2012 at 7:50

GoogleCodeExporter commented 8 years ago
Ok.But I am not sure why 
 prox:    n x n proximity matrix 
I guess prox should be a length(Y_tst) Times  (length(Y_tst)+length(Y_trn))
where the first length(Y_tst)  times length(Y_tst)  should be the proximity 
between the test samples and the rest be the Proximity between the train and 
test.

Original comment by zhangleu...@gmail.com on 28 Sep 2012 at 8:35

GoogleCodeExporter commented 8 years ago
note that there are two cases described in the tutorial file 
tutorial_proximity_training_test.m

one where training is done and the testing is not aware of the training 
examples. the proximity calculation REQUIRES training example information and 
when that is not available will default to proximity of only the test examples

the second example is what you are looking for
pass test examples and labels into classRF_train.. the returned model will have 
the proximity information
 model2 = classRF_train(X_trn,Y_trn, 2000, 0, extra_options,X_tst,Y_tst);
 model2.proximity_tst

do post a snippet of code if you still have issues.

Original comment by abhirana on 28 Sep 2012 at 8:46

GoogleCodeExporter commented 8 years ago
Hi,abhirana,
Is there some method for us to find the margin for each tree?

Original comment by zhangleu...@gmail.com on 28 Sep 2012 at 10:32

GoogleCodeExporter commented 8 years ago
can you define margin?

Original comment by abhirana on 28 Sep 2012 at 5:55

GoogleCodeExporter commented 8 years ago
the margin is befined by breiman .
I guess we  can get it from the 'prediction_pre_tree'.
sorry, we can only get it for a collection of trees, not for each tree. 

Original comment by zhangleu...@gmail.com on 29 Sep 2012 at 2:27

Attachments:

GoogleCodeExporter commented 8 years ago
when you use prediction_per_tree you will get a nexample x ntree matrix, so you 
will get it for individual tree predition for each test example

Original comment by abhirana on 29 Sep 2012 at 2:29

GoogleCodeExporter commented 8 years ago
Hi,abhirana.
If I set the extra_optipns.replace=0.
 There are still so many zeros in the model.inbag,which means some samples are still out of bag.Why this happen?
code:load data/twonorm

%modify so that training data is NxD and labels are Nx1, where N=#of
%examples, D=# of features

X = inputs';
Y = outputs;

[N D] =size(X);
%randomly split into 250 examples for training and 50 for testing
randvector = randperm(N);

X_trn = X(randvector(1:250),:);
Y_trn = Y(randvector(1:250));
X_tst = X(randvector(251:end),:);
Y_tst = Y(randvector(251:end));
extra_options.replace = 0 ;

extra_options.keep_inbag = 1; %(Default = 0)
model = classRF_train(X_trn,Y_trn, 100, 4, extra_options);

Original comment by zhangleu...@gmail.com on 29 Sep 2012 at 7:09

GoogleCodeExporter commented 8 years ago
replace will only change the the replacement scheme from with 
replacement(default or 1) to without replacement(0). it doesn't have any effect 
on the number of out bag examples because thats controlled by the sampsize 
variable

if you want to change how many examples you want to sample per tree change the 
sampsize variable

Original comment by abhirana on 29 Sep 2012 at 11:37

GoogleCodeExporter commented 8 years ago
Ok.But what is the default value for the sampsize in your code? Seems it does 
not mention in the tutorial file.

Original comment by zhangleu...@gmail.com on 1 Oct 2012 at 7:05

GoogleCodeExporter commented 8 years ago
randomforests default: sampling N times with replacement from N training 
examples (which are the same as what is done for bagging).

Original comment by abhirana on 1 Oct 2012 at 5:25

GoogleCodeExporter commented 8 years ago
So ,in this case,if replace=0, why there are so many 0s in the inbag?
I guess the 0 in the inbag means this sample is out of bag.But we have to 
sample N times without replacement.So ervey sample should in the bag.

Original comment by zhangleu...@gmail.com on 2 Oct 2012 at 2:09

GoogleCodeExporter commented 8 years ago
Note the sampsize default is .632*N when doing without replacement. That 
proportion is around the same when doing with replacement. So you at having 
same number of out bags both ways

Original comment by abhirana on 2 Oct 2012 at 3:00

GoogleCodeExporter commented 8 years ago
Ok.  So when with replacement, we sample N from N. If replace=0, we sample 
0.632*N from N without replacement.
I have another question. When we select mtry feature from all the features, 
could we assign a weight vector and select the feature according to their 
weight? If so, where could I change the code?
I can see there are 'mexRF_train' funtion in your code.However, I could not 
find the code for this funtion in the package.
Thanks.

Original comment by zhangleu...@gmail.com on 2 Oct 2012 at 3:26

GoogleCodeExporter commented 8 years ago
you can always change how many examples are being sampled by tweaking sampsize

mexRF_train is compiled from a bunch of files in the src folder. you can find 
the list of files being compiled in compile_windows.m. you will have to modify 
the c/c++ and maybe fortran code to implement that.

Original comment by abhirana on 2 Oct 2012 at 3:47