Closed mikebenfield closed 5 years ago
Are you sure you want to do classification? Your test data doesn't look like it. Could you share the data somehow? Or simulate something similar?
Yes, in the training data, the last column (here with a value of 0) is supposed to be the class. The data is just the forest covertype data here: https://archive.ics.uci.edu/ml/datasets/covertype
(although I changed the classes to be 0-6 rather than 1-7).
Oh, I see you say the test data doesn't look like it. How should the test data be different?
I was just wondering about the 0.000000
in the test data, but that shouldn't be the problem.
Could you please provide a reproducible example? If possible, please also include the scikit-learn run.
Sure. See this gist, which is a Python script to generate an artificial dataset, train a sklearn forest on it, make predictions, and evaluate accuracy.
Sample run:
$ mkdir ~/Data2
$ python compare.py generate --train_size 100000 --test_size 20000 --n_features 100 --n_classes 10 --directory ~/Data2
$ time ../ranger/cpp_version/build/ranger --verbose --treetype 1 --ntree 200 --targetpartitionsize 2 --randomsplits 10 --nthreads 3 --file ~/Data2/data.train --depvarname target --outprefix ~/Data2/rangerout --skipoob --splitrule 5 --write --memmode 1
Starting Ranger.
Loading input file: /Users/mike/Data2/data.train.
Growing trees ..
<snip>
Finished Ranger.
real 1m39.835s
user 4m18.575s
sys 0m5.413s
$ time ../ranger/cpp_version/build/ranger --verbose --predict ~/Data2/rangerout.forest --file ~/Data2/data.test --outprefix ~/Data2/rangerout2
Starting Ranger.
Loading input file: /Users/mike/Data2/data.test.
Loading forest from file /Users/mike/Data2/rangerout.forest.
<snip>
Finished Ranger.
real 0m5.905s
user 0m9.881s
sys 0m0.535s
$ python compare.py prob_train_predict --directory ~/Data2 --tree_count 200 --min_samples_split 2 --split_tries 10 --thread_count 3 --prediction_file ~/Data2/sklearn.prediction
8.99282145104371 seconds to parse training data
39.06211919989437 seconds to train
1.902241539908573 seconds to parse testing data
3.7025850540958345 seconds to predict
$ python compare.py evaluate --prediction ~/Data2/rangerout2.prediction --target ~/Data2/data.target
accuracy: 0.69105
$ python compare.py evaluate --prediction ~/Data2/sklearn.prediction --target ~/Data2/data.target
accuracy: 0.7336
As you can see, ranger trains in about 99 seconds; sklearn in about 48 seconds (including parsing time, which is slow). The difference in accuracy is not nearly as big as with the covertype dataset, but it's still there.
Again, just want to make sure I'm using ranger correctly.
Thanks for providing the example. We have two issues here, accuracy and runtimes.
mtry
and randomsplits
. The number variables considered for splitting is mtry
(max_features
in scikit-learn). the number of random splits tried for each of the mtry
variables is randomsplits
(always 1 in scikit-learn). targetpartitionsize
sets the node size to not continue splitting (as min_samples_leaf
in scikit-learn)With these, the equivalent ranger parameters would be (omitted the file parameters)
--verbose --treetype 1 --ntree 200 --targetpartitionsize 1 --randomsplits 1 --nthreads 3 --depvarname target --skipoob --splitrule 5 --write --noreplace --mtry 10 --fraction 1
This should result in comparable accuracy to scikit-learn.
ranger excels on datasets with many predictors and/or small numbers of unique values in the predictors. Your dataset is the exact opposite of that. ;) For details see also our paper (e.g. Table 2 and Fig. 4)
Great; thanks for the help!
Please reopen if needed
I'm using Ranger from the command line. I'm getting runtime performance and accuracy scores that are much worse than, for instance, scikit-learn, so I'm wondering if I'm using it wrong.
I'm trying to use the extremely randomized trees for classification.
Here's how I'm training Ranger:
And how I'm making predictions:
Training takes around 7 minutes (as opposed to 2-3 minutes for scikit-learn) and accuracy is something around 80% (as opposed to 96% for scikit-learn).
The first couple rows of my training data look like this:
And the first couple rows of the data I'm predicting for looks like this:
Is this the right format for the data and the correct command line invocation? Thanks for any help.