On a real production dataset with 5 explanatory variables and ~1000 lines, I received a SystemStackError: stack level too deep when calling DecisionTree::ID3Tree#train.
Trying to figure out what was happening, I built the following simple dataset, which allows to reveal the bug:
The reason of this bug seems to lie in the specific output ([-1, -1]) of DecisionTree::ID3Tree#id3_continuous in the case if values.size == 1 (see this line).
Returning [0, -1] instead of [-1, -1] in the cases if values.size == 1 and if gain.size == 1 in the method #id3_continuous solves the problem.
It would also be relevant to stop the recursion in the case where the selection of each variable leads to a zero gain. That can be done adding in #id3_train the following line:
return data.first.last if performance.all? { |a, b| a <= 0 }
On a real production dataset with 5 explanatory variables and ~1000 lines, I received a
SystemStackError: stack level too deep
when callingDecisionTree::ID3Tree#train
.Trying to figure out what was happening, I built the following simple dataset, which allows to reveal the bug:
The reason of this bug seems to lie in the specific output (
[-1, -1]
) ofDecisionTree::ID3Tree#id3_continuous
in the caseif values.size == 1
(see this line).Returning
[0, -1]
instead of[-1, -1]
in the casesif values.size == 1
andif gain.size == 1
in the method#id3_continuous
solves the problem.It would also be relevant to stop the recursion in the case where the selection of each variable leads to a zero gain. That can be done adding in
#id3_train
the following line:after this line:
What do you think?
Do you want me to make a pull request with these changes?