correct command line usage

mikebenfield commented 6 years ago

I'm using Ranger from the command line. I'm getting runtime performance and accuracy scores that are much worse than, for instance, scikit-learn, so I'm wondering if I'm using it wrong.

I'm trying to use the extremely randomized trees for classification.

Here's how I'm training Ranger:

./ranger --verbose --treetype 1 --ntree 200 --targetpartitionsize 2 --randomsplits 8 --nthreads 3  --file ~/Data/CovType/data_ranger.train --depvarname output --outprefix ~/Data/CovType/rangerout --skipoob --splitrule 5 --write --mtry 8

And how I'm making predictions:

./ranger --verbose --predict ~/Data/CovType/rangerout.forest --file ~/Data/CovType/data_ranger.test --outprefix ~/Data/CovType/rangerout2

Training takes around 7 minutes (as opposed to 2-3 minutes for scikit-learn) and accuracy is something around 80% (as opposed to 96% for scikit-learn).

The first couple rows of my training data look like this:

a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23 a24 a25 a26 a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41 a42 a43 a44 a45 a46 a47 a48 a49 a50 a51 a52 a53 output
3080.000000 48.000000 19.000000 85.000000 23.000000 2985.000000 224.000000 196.000000 99.000000 5778.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0

And the first couple rows of the data I'm predicting for looks like this:

a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23 a24 a25 a26 a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41 a42 a43 a44 a45 a46 a47 a48 a49 a50 a51 a52 output
2582.000000 144.000000 10.000000 420.000000 54.000000 711.000000 235.000000 238.000000 133.000000 1316.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

Is this the right format for the data and the correct command line invocation? Thanks for any help.

mnwright commented 6 years ago

Are you sure you want to do classification? Your test data doesn't look like it. Could you share the data somehow? Or simulate something similar?

mikebenfield commented 6 years ago

Yes, in the training data, the last column (here with a value of 0) is supposed to be the class. The data is just the forest covertype data here: https://archive.ics.uci.edu/ml/datasets/covertype

(although I changed the classes to be 0-6 rather than 1-7).

mikebenfield commented 6 years ago

Oh, I see you say the test data doesn't look like it. How should the test data be different?

mnwright commented 6 years ago

I was just wondering about the 0.000000 in the test data, but that shouldn't be the problem. Could you please provide a reproducible example? If possible, please also include the scikit-learn run.

mikebenfield commented 6 years ago

Sure. See this gist, which is a Python script to generate an artificial dataset, train a sklearn forest on it, make predictions, and evaluate accuracy.

Sample run:

$ mkdir ~/Data2
$ python compare.py generate --train_size 100000 --test_size 20000 --n_features 100 --n_classes 10 --directory ~/Data2
$ time ../ranger/cpp_version/build/ranger --verbose --treetype 1 --ntree 200 --targetpartitionsize 2 --randomsplits 10 --nthreads 3  --file ~/Data2/data.train --depvarname target --outprefix ~/Data2/rangerout --skipoob --splitrule 5 --write --memmode 1
Starting Ranger.
Loading input file: /Users/mike/Data2/data.train.
Growing trees ..
<snip>
Finished Ranger.

real    1m39.835s
user    4m18.575s
sys 0m5.413s
$ time ../ranger/cpp_version/build/ranger --verbose --predict ~/Data2/rangerout.forest --file ~/Data2/data.test --outprefix ~/Data2/rangerout2
Starting Ranger.
Loading input file: /Users/mike/Data2/data.test.
Loading forest from file /Users/mike/Data2/rangerout.forest.
<snip>
Finished Ranger.

real    0m5.905s
user    0m9.881s
sys 0m0.535s
$ python compare.py prob_train_predict --directory ~/Data2 --tree_count 200 --min_samples_split 2 --split_tries 10 --thread_count 3 --prediction_file ~/Data2/sklearn.prediction
8.99282145104371 seconds to parse training data
39.06211919989437 seconds to train
1.902241539908573 seconds to parse testing data
3.7025850540958345 seconds to predict
$ python compare.py evaluate --prediction ~/Data2/rangerout2.prediction --target ~/Data2/data.target
accuracy: 0.69105
$ python compare.py evaluate --prediction ~/Data2/sklearn.prediction --target ~/Data2/data.target
accuracy: 0.7336

As you can see, ranger trains in about 99 seconds; sklearn in about 48 seconds (including parsing time, which is slow). The difference in accuracy is not nearly as big as with the covertype dataset, but it's still there.

Again, just want to make sure I'm using ranger correctly.

mnwright commented 6 years ago

Thanks for providing the example. We have two issues here, accuracy and runtimes.

Accuracy

You mixed up mtry and randomsplits. The number variables considered for splitting is mtry (max_features in scikit-learn). the number of random splits tried for each of the mtry variables is randomsplits (always 1 in scikit-learn).
ranger grows on bootstrap samples by default, scitkit-learn's ExtraTreesClassifier on all samples.
In ranger, targetpartitionsize sets the node size to not continue splitting (as min_samples_leaf in scikit-learn)

With these, the equivalent ranger parameters would be (omitted the file parameters)

--verbose --treetype 1 --ntree 200 --targetpartitionsize 1 --randomsplits 1 --nthreads 3 --depvarname target --skipoob --splitrule 5 --write --noreplace --mtry 10 --fraction 1

This should result in comparable accuracy to scikit-learn.

Runtime

ranger excels on datasets with many predictors and/or small numbers of unique values in the predictors. Your dataset is the exact opposite of that. ;) For details see also our paper (e.g. Table 2 and Fig. 4)

mikebenfield commented 6 years ago

Great; thanks for the help!

mnwright commented 5 years ago

Please reopen if needed

imbs-hl / ranger

correct command line usage #341

Accuracy

Runtime