tingliu / randomforest-matlab

Automatically exported from code.google.com/p/randomforest-matlab
4 stars 3 forks source link

Categorical Predictor Variables #41

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Great work on the random forest implementation.  Coming from the R version, 
this was an easy adjustment.  I have been using it effectively on matrices of 
continuous data, but how does it handle categorical data?  I can't pass in an 
array of strings, nor can I assign integer values to categories because I don't 
want them to be treated as continuous.  Any suggestions?

Original issue reported on code.google.com by jmccra...@gmail.com on 23 Aug 2012 at 7:06

GoogleCodeExporter commented 8 years ago
Hello,

i think i put in categories somewhere (via ncat and cat variables) but at that 
point i didn't have time/and a dataset to test out both the r version (i am 
somewhat r-challenged) and the matlab version. if you have a simple 
categorical/numerical mixed dataset can you send it out to me? i can try in 
matlab and r; if not that's ok, let me see if i can get one from somewhere. 

the issue with strings is that they are anyways converted to categorical 
integers within the r code; and i don't know if matlab even supports mixing a 
integer, string into a single matrix (for me to process within the training 
code);
so if you can do some kind of preprocessing (like converting into categorical 
integers) before sending it to the rf training code than maybe that works out?

Original comment by abhirana on 24 Aug 2012 at 6:00

GoogleCodeExporter commented 8 years ago
Thanks for the quick response!

Yeah, combining strings and integers in one matrix is not possible.  After a 
bit more reading, it seems the best bet would probably be to construct a 
dataset array (http://www.mathworks.com/help/toolbox/stats/bqziht7-1.html) 
which is similar to R's data tables.  They can hold cells, categorical, 
ordinal, and numeric columns, and columns can be accessed by name or by index.  
This seems like it would be the most flexible, but they are relatively new 
(R2007a) and require the Statistics Toolbox so I have never seen them used.

Anyway, back to random forests: yes it wold be simple to convert to integer 
codes for categorical variables.  I just need to be sure that they are being 
treated as categorical instead of continuous so that the order of the coding 
doesn't bias the splits.  I didn't see anything in the tutorial about 
categorical variables, but if there's already a way to do it that's great, 
could you explain?

If not, I don't have my dataset yet, but the hospital dataset in the statistics 
toolbox has mixed data I think, as does census income from the UCI repository.  
I haven't looked at these too much, but I hope they help.  I looked at 
TreeBagger again, and it has an option to enter a logical array to identify 
categorical variables, but I would prefer to use your package as I have read it 
is considerably faster.

Thanks for the help!

Original comment by jmccra...@gmail.com on 24 Aug 2012 at 6:14

GoogleCodeExporter commented 8 years ago
Hey

i just added new code into the svn (both classification/regression) and i think 
categorical data is now considered within code.

how do i know its being considered? shorter and more accurate trees are being 
created. 

just make sure that the categorical data values get a unique number (a unique 
integer should suffice) for each categories they belong to. 

the example code is at the end of the tutorial files ( i converted existing 
datasets into categorical data). its basically telling what features are 
categorical via an option, extra_options.categorical_feature = 1xD vector with 
mapping of what features to consider as categorical

do tell if you run into any issues.

yeh, i guess i will skip the mixed matrix till it is available in base matlab.

Original comment by abhirana on 26 Aug 2012 at 12:02

GoogleCodeExporter commented 8 years ago
Awesome, thanks!  It's great to see such a quick update.

Original comment by jmccra...@gmail.com on 27 Aug 2012 at 3:43

GoogleCodeExporter commented 8 years ago
So I finally got my dataset and want to run the random forest, but I'm not 
seeing the example in the tutorial.  Did you upload the changes?

Original comment by jmccra...@gmail.com on 20 Sep 2012 at 7:27

GoogleCodeExporter commented 8 years ago
oh its at the end of the tutorial file

http://code.google.com/p/randomforest-matlab/source/browse/trunk/RF_Class_C/tuto
rial_ClassRF.m#256

Original comment by abhirana on 20 Sep 2012 at 7:54

GoogleCodeExporter commented 8 years ago
Ah, I see.  I had just redownloaded the precompiled .zip from the download 
link.  What do I need to update from the files in the source tab?

Original comment by jmccra...@gmail.com on 20 Sep 2012 at 9:15

GoogleCodeExporter commented 8 years ago
actually the svn version is somewhat ahead of the precompiled version in the 
download link.

attached file is an extract

Original comment by abhirana on 20 Sep 2012 at 9:24

Attachments:

GoogleCodeExporter commented 8 years ago
Awesome, thank you.  I really appreciate all the work you've put in here.  I'll 
check back in once I've tested it out.

Original comment by jmccra...@gmail.com on 20 Sep 2012 at 9:44

GoogleCodeExporter commented 8 years ago
Looks good, the forests seem to be working as expected. Thank's for all your 
work!

Original comment by jmccra...@gmail.com on 1 Oct 2012 at 9:03

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Hello,

Could you please provide me with the pre-compiled version of the code shared 
above.
I tried a lot but failed to generate mex file.
As my compilation is giving various error in classRF.cpp code.

Original comment by Shalini1...@iiitd.ac.in on 24 Jun 2013 at 6:18

GoogleCodeExporter commented 8 years ago
attached is the latest pre-compiled version of the code

Original comment by abhirana on 25 Jun 2013 at 3:13

Attachments:

GoogleCodeExporter commented 8 years ago
Ah, thanks a lot !

Original comment by Shalini1...@iiitd.ac.in on 25 Jun 2013 at 4:54