Added accuracy test for iris data

sanity / quickml

A fast and easy to use decision tree learner in java

http://quickml.org/

GNU Lesser General Public License v3.0

231 stars 54 forks source link

Added accuracy test for iris data #120

Closed mbatchkarov closed 8 years ago

mbatchkarov commented 9 years ago

This is a test to demonstrate #116 has not been resolved in 9ec1c3ff24e0d9291b27c1ac563e0.

I've deliberately increased the maximum depth and number a trees in your to make sure the classifier will overfit. Despite this, it predicts versicolor for all instances in the training set.

sanity commented 9 years ago

@athawk81 said he would investigate this, but mentioned that with a .7 skip attribute probability, but only 4 attributes, it wouldn't be surprising if this led to some stumpy trees, so this behavior might be expected.

athawk81 commented 9 years ago

the ignoreAttributeProbability is set to .7 by default. Given that there are only 4 attributes, the odds that all attributes will be ignored (and tree building will cease along that branch) are good. My guess is that you are probably getting shallow trees and as a consequence are just predicting the most common class for all training instances. How about try setting it to 0 or .2 and see what happens? I'll do this myself tomorrow if i don't here back from you.

mbatchkarov commented 9 years ago

I've set that probability to 0-- see line 22 in the commit. Is there another parameter/field that needs to be set? If there is, the docs need an update.

athawk81 commented 9 years ago

Hey mbatchkarov, apologies for the delay. Please see the the test class quickml.supervised.classifier.randomForest.TestIrisAccuracy

on the most recent version of master. I created a random forest that gave different classifications for the different instances.

mbatchkarov commented 9 years ago

Hey, sorry it has taken me so long to get back to you. I still think the issue has not been resolved. In the first instance, please merge this pull request. In particular, I am interested in the last bit I've added, which calculates accuracy on the training set (somewhat naively). Do make any changes that you feel are appropriate, e.g.

calculate accuracy in a more concise way, if there is one already in quickml
tweak the settings of the random forest builder.

As it stands TestIrisAccuracy is just a smoke test and doesn't have anything to do with accuracy. I'd like to TestIrisAccuracy to be a self-contained example of how one gets good performance for the iris data as well as a test that quickml can do that.