sanity / quickml

A fast and easy to use decision tree learner in java
http://quickml.org/
GNU Lesser General Public License v3.0
231 stars 54 forks source link

RandomForest example is broken #116

Closed mbatchkarov closed 9 years ago

mbatchkarov commented 9 years ago

The example on the quickml website, showing how to train a random forest on the iris data, is broken. Consider the following example:

package quickml;

import quickml.data.AttributesMap;
import quickml.data.ClassifierInstance;
import quickml.supervised.classifier.randomForest.RandomForest;
import quickml.supervised.classifier.randomForest.RandomForestBuilder;
import java.io.IOException;
import java.util.List;

public class TestIrisAccuracy {
    public static void main(String[] args) throws IOException {
        List<ClassifierInstance> irisDataset = PredictiveAccuracyTests.loadIrisDataset();
        final RandomForest randomForest = new RandomForestBuilder().buildPredictiveModel(irisDataset);
        AttributesMap attributes = new AttributesMap();
        attributes.put("sepal-length", 5.84);
        attributes.put("sepal-width", 3.05);
        attributes.put("petal-length", 3.76);
        attributes.put("petal-width", 1.20);
        System.out.println("Prediction: " + randomForest.predict(attributes));
    }
}

This outputs:

Prediction: {Iris-virginica=0.3333333333333333, Iris-setosa=0.3333333333333333, Iris-versicolor=0.3333333333333333}

The forest is clearly not learning anything. I've observed the same behaviour with a range of other datasets. I am running the latest git version (commit 7584656f32)

PS I had to remove an empty line at the end of the dataset, otherwise a fourth empty label is created. This is a separate issue and does not affect the bug above. I might open another issue for that.

athawk81 commented 9 years ago

Agreed. Something is fishy with this prediction...though my suspicion is that ether the default settings of the random forest are not adequate (e.g. depth of trees may be set to 1) or the data set is borked. Am working on a major refactor of TreeBuilder presently, but will have a closer look at this issue this weekend.

athawk81 commented 9 years ago

Hey, the issue was the data set load...the attribute values were being loaded as strings when they should have been loaded in as Numbers (e.g. Doubles). The default settings are also not very good on that particular problem. Since there are only 4 attributes, using an ignore attribute probability of .7 isn't very effective. The latest release of QuickML has the fix for the load in place.