Closed PetrToman closed 12 years ago
I tried the steps above, and got the same results. But I believe this may be correct, and the result of over-fitting.
I looked at the "zero error" SVM and did "Evaluate Method" (using the training EGB) on it, and also got a zero. Then I did "Validation Chart" on it and saw each of the rows had a correct result,. I normalized the eval data and got a high error rate, and verified that using a validation chart.
I got similar results using the SVM in step 3... so at this point, I really can't find any reason to not believe the SVM training error.
Since data7_output.csv
contains only ones in the Output:y
column, I believe it's not an overfitting, but rather a total underfitting :-) Therefore the training error should be ~100%. I think I've found the bug - in EncogUtility.calculateClassificationError(), instead of:
return (double)(total-correct) / (double)total; // line 448
there should be:
return (double) correct / (double) total;
With this fix, "Training error" gets to ~95 % and "Validation error" to ~58 %, which looks much more sensible to me.
Ah, that makes sense, I will take a look.
The fix above causes issues, as it inverses the way error percents are traditionally calculated. The "training error" is the percent incorrect, not the percent correct. So when I make the change you give, the error goes to zero % when the correct goes to zero.. Which is similar to a "test score", but not how machine learning is traditionally evaluated.
Ok, let's have "training error" to be the percent incorrect. However, there certainly is not 0.000000 % of incorrect samples. Interestingly enough, if the values for C and gamma from above are put directly to org.encog.ml.svm.training.SVMSearchTrain, then several iterations are performed as expected. Maybe there is some numerical problem in the training error calculation...
I went through the training process again and found that 0.000000 % training error really meant overfitting (i.e. you were right): when I replaced data7_eval.csv
by data7_train.csv
and run task-evaluate I got data7_output.csv
with 100 % match in the Output:y
column. I got confused because libsvm (grid.py) searches for the best C and gamma values based on the cross validation, not just by minimizing the training error on the training data. I think it would be useful if SVMSearchTrain did the same thing when the validation set is available.
data7_output.csv
(wherey
=Output:y
); after executing task-evaluate.