jeffheaton / encog-java-core

http://www.heatonresearch.com/encog
Other
742 stars 268 forks source link

SVM classification: suspiciously low training error #83

Closed PetrToman closed 12 years ago

PetrToman commented 12 years ago
  1. In Workbench, create .ega file for SVM classification - using data: http://dione.zcu.cz/~toman40/encog/data7.zip - and set "Maximum Error Percent (0-100)" = 10.
  2. Excecute "task-full" -> the training will end up performing only 1 iteration, with "Training Error" = 0.000000 % (wrong).
  3. Run SVM Search Training again and set Gamma = 0.005-0.01 step 0.001 and C = 0.01-0.05 step 0.01 -> the training will end up with "Training error" = 6.68 %, but there is only ~60 % of correctly classified rows in data7_output.csv (where y=Output:y); after executing task-evaluate.
jeffheaton commented 12 years ago

I tried the steps above, and got the same results. But I believe this may be correct, and the result of over-fitting.

I looked at the "zero error" SVM and did "Evaluate Method" (using the training EGB) on it, and also got a zero. Then I did "Validation Chart" on it and saw each of the rows had a correct result,. I normalized the eval data and got a high error rate, and verified that using a validation chart.

I got similar results using the SVM in step 3... so at this point, I really can't find any reason to not believe the SVM training error.

PetrToman commented 12 years ago

Since data7_output.csv contains only ones in the Output:y column, I believe it's not an overfitting, but rather a total underfitting :-) Therefore the training error should be ~100%. I think I've found the bug - in EncogUtility.calculateClassificationError(), instead of:

return (double)(total-correct) / (double)total;   // line 448

there should be:

return (double) correct / (double) total;

With this fix, "Training error" gets to ~95 % and "Validation error" to ~58 %, which looks much more sensible to me.

jeffheaton commented 12 years ago

Ah, that makes sense, I will take a look.

jeffheaton commented 12 years ago

The fix above causes issues, as it inverses the way error percents are traditionally calculated. The "training error" is the percent incorrect, not the percent correct. So when I make the change you give, the error goes to zero % when the correct goes to zero.. Which is similar to a "test score", but not how machine learning is traditionally evaluated.

PetrToman commented 12 years ago

Ok, let's have "training error" to be the percent incorrect. However, there certainly is not 0.000000 % of incorrect samples. Interestingly enough, if the values for C and gamma from above are put directly to org.encog.ml.svm.training.SVMSearchTrain, then several iterations are performed as expected. Maybe there is some numerical problem in the training error calculation...

PetrToman commented 12 years ago

I went through the training process again and found that 0.000000 % training error really meant overfitting (i.e. you were right): when I replaced data7_eval.csv by data7_train.csv and run task-evaluate I got data7_output.csv with 100 % match in the Output:y column. I got confused because libsvm (grid.py) searches for the best C and gamma values based on the cross validation, not just by minimizing the training error on the training data. I think it would be useful if SVMSearchTrain did the same thing when the validation set is available.