sjwohlman / randomforest-matlab

Automatically exported from code.google.com/p/randomforest-matlab
0 stars 0 forks source link

what is Y_train in classRF_train? #28

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
in the function, classRF_train(X,Y,ntree,mtry, extra_options), what are X & Y?? 
as per readme file, they are X: data matrix, Y: target values. could you please 
explain more clearly their individual role.
as far i am getting, for xtrain and xtest, features are being taken as input, 
but what about ytrain and ytest? what should be the possible input their? is 
that a some kind of index? please correct me if i am wrong.
also tell me when to use RF_Class_C and when RF_Reg_C with some example....
thank you.

Original issue reported on code.google.com by abhi4emb...@gmail.com on 7 Mar 2012 at 3:54

GoogleCodeExporter commented 8 years ago
hi 

the X and Y (for say the diabetes dataset) included in the package represents 
the data.

the description for X and Y in the diabetes dataset is explained here 
http://www-stat.stanford.edu/~tibs/ftp/lars.pdf (pg 2, table-1)

the goal is to predict the response Y based on the inputs X. 

RF is mostly used in a supervised learning setting where multiple features (in 
X) are used to predict a single response or target (in Y)

so in your setting, you have to group xtrain, ytrain together in the _train() 
functions and look at the performance of the RF algorithm by using only the 
xtest in _predict() and compare the results obtained from _predict() with ytest

you can run either classification (using RF_Class_C) or regression (using 
RF_Reg_C)

take a look at pg-11
http://www.cs.colorado.edu/~grudic/teaching/CSCI5622_2006/Introduction.pdf

its better if you take a look at an introductory statistics book

i am not responding to your que in issue-25 as its written all here

Original comment by abhirana on 7 Mar 2012 at 4:15

GoogleCodeExporter commented 8 years ago
when i run  RF_tutorial.m, it loads data/twonorm.actually it load twonorm.mat 
producing two matrix named output and input. from where these values come 
from?what does value in output variable signify?is it taken random?

Original comment by abhi4emb...@gmail.com on 7 Mar 2012 at 8:50

GoogleCodeExporter commented 8 years ago
these are the details of the twonorm.mat

http://www.cs.toronto.edu/~delve/data/twonorm/desc.html

the data in twonorm.mat is subsampled with about 300 examples from the twonorm 
distribution.

Original comment by abhirana on 7 Mar 2012 at 8:54

GoogleCodeExporter commented 8 years ago
output (class labels/target values, a 1 dimensional vector) = Y
input (matrix from multiple features) = X

Original comment by abhirana on 7 Mar 2012 at 8:55

GoogleCodeExporter commented 8 years ago
i have seem the tar file but could not figure out exactly what's there in 
output??
some combination of 1's and -1's but in what pattern?why they are only written 
so?any reasons behind or just tried to represent 1-D vector?but why in 
combination of 1 and -1?
would it give me wrong result if i put all values as '1' in output matrix......

Original comment by abhi4emb...@gmail.com on 7 Mar 2012 at 9:38

GoogleCodeExporter commented 8 years ago
are you familiar with classification and regression problems where the goal is 
to learn a function from data? i think you need to brush that knowledge. i gave 
you the link so that you can know what distribution generates twonorm.

in simplest term i can generate a synthetic dataset as follows:
Yhat = (X1 + X2)^2, where X1 and X2 are two features and Y is the output, with 
the goal that the classifier can predict for future examples from these 
distribution

in classification, i can make a rule saying if Yhat > 2 its class-1 else its 
class-2. its no fun learning if all labels are the same. the pattern is not in 
Yhat or Y but in X and which the classifier is expected to learn. 

in regression i try to learn the rule for predicting Yhat values directly 
rather than via labels.

another example would be can you predict the chance of some disease (yes/no - 
classes) or amount of cholesterol (continous values) if you are given the 
height, weight, age, etc features. the goal is to learn patters from features 
like height etc and predict disease/cholesterol for future patients.

Original comment by abhirana on 7 Mar 2012 at 9:50

GoogleCodeExporter commented 8 years ago

Original comment by abhirana on 31 Mar 2012 at 8:39