igrigorik / decisiontree

ID3-based implementation of the ML Decision Tree algorithm
1.44k stars 130 forks source link

Infinite recursion due to the handling of zero gains in id3_continuous #19

Open nicomahler opened 10 years ago

nicomahler commented 10 years ago

On a real production dataset with 5 explanatory variables and ~1000 lines, I received a SystemStackError: stack level too deep when calling DecisionTree::ID3Tree#train.

Trying to figure out what was happening, I built the following simple dataset, which allows to reveal the bug:

attributes = ["X0", "X1", "X2", "X3"]
data = [["a", 0, 1, 1, 1], ["a", 0, 1, 0, 0], ["a", 0, 0, 1, 0], ["a", 0, 0, 0, 1]]
data_type = {X0: :discrete, X1: :continuous, X2: :continuous, X3: :continuous}
tree = DecisionTree::ID3Tree.new(attributes, data, 1, data_type)
tree.train  # SystemStackError is raised here!

The reason of this bug seems to lie in the specific output ([-1, -1]) of DecisionTree::ID3Tree#id3_continuous in the case if values.size == 1 (see this line).

Returning [0, -1] instead of [-1, -1] in the cases if values.size == 1 and if gain.size == 1 in the method #id3_continuous solves the problem.

It would also be relevant to stop the recursion in the case where the selection of each variable leads to a zero gain. That can be done adding in #id3_train the following line:

return data.first.last if performance.all? { |a, b| a <= 0 }

after this line:

performance = attributes.collect { |attribute| fitness_for(attribute).call(data, attributes, attribute) }

What do you think?

Do you want me to make a pull request with these changes?