igrigorik / decisiontree

ID3-based implementation of the ML Decision Tree algorithm
1.44k stars 130 forks source link

Overfitting or am I getting the ruleset interpretation wrong? #3

Closed pagojo closed 8 years ago

pagojo commented 12 years ago

By calling the ruleset method after train on a DecisionTree::ID3Tree object (setup for continuous data) I expect to get back a number of rules. Each rule I interpret as a series of ANDed clauses.

However, many times I get cases where the same attribute is repeated in a clause although it may be included in a clause above it.

e.g.,

attrib_1 < 0.02123562506819547
attrib_2 >= 0.1922781177611915
attrib_3 < 0.2879504779121489
attrib_4 < 0.26382498790056597
attrib_4 < 0.193308315974597
=> class1()

In the above case the second mention to attrib_4 is superfluous if the rule is interpreted as:

if attrib_1 <  0.02123562506819547 
and attrib_2 >= 0.1922781177611915
and attrib_3 < 0.2879504779121489
and attrib_4 < 0.26382498790056597
and attrib_4 < 0.193308315974597
then class1()
end

So, am I wrong to assume a chain of ANDed clauses? If not, then is the second occurrence of attrib_4 a sign of overfitting which I can safely ignore? Could this just be a bug?

igrigorik commented 12 years ago

Have you tried graphing the actual object? https://github.com/igrigorik/decisiontree/blob/master/lib/decisiontree/id3_tree.rb#L124

Might help understand the structure of your tree.

pagojo commented 12 years ago

Cheers I did that, I had to install GraphViz and GraphR. One thing I noticed is that verison 0.3.2 can't be found on Rubygems.org though (for use by gem install or bundler).

My original posting was influenced by how AI4R does spit out the decision tree rules, which can then be evaled (or copy-pasted in the code).