Possible problem in SMV (LASVM) - incorrect results

hmf commented 5 years ago

Expected behaviour

I am using the following dataset: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t

Tests with libSVM and a version of SMO I coded have accuracy on the order of 0.9655.

Actual behaviour

However with LASVM I get an error rate of 0.97725. I made two tests:

I first learn with the training data set svmguide1. This gives me the expected 97% accuracy. I then use the same model to predict and get the error above.
I repeat as above but now I test and predict on the training data set. I get approximately the same error rate.

Code snippet

Learning with the training data:

      val sk = new GaussianKernel(0.5)
      val svm2 = new SVM(sk, C)
      val ppy = target.map(e => if (e <= 0.0) 0 else e)
      val px = data.map(_.map(_.toDouble))
      svm2.learn(px, ppy)
      svm2.finish()

      val ps = svm2.predict(px)
      val ok = ps.zip(ppy).count{ case (yhat, yo) => if (yo == yhat) true else false}.toFloat
      val accuracy1 = ok/ppy.length
      val error1 = 1.0 - accuracy1
      println(s"svm2 error = ${error1*100}% ; accuracy = ${accuracy1*100}%")

When I do the same above with the test data I get an error.

Input data

Train Data set: svmguide1

Test data set: svmguide1.t

When the data is read in, we have labels -1 and +1. Because SMILE requires labels 1 and 0 we relabel ppy (-1 is mapped to 0, +1 is mapped to 1). These data sets are scaled into a range of [-1 .. +1].

Information

What Java (OpenJDK, Orack JDK, etc.) are you using and which Java version

java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)

Which Smile version 1.5.3
What is your build system (e.g. Ubuntu, MacOS, Windows, Debian ) Ubuntu 18.04 + sbt 1.2.8

hmf commented 5 years ago

Something definitely wrong here. I made several experiments using either the learn or test data sets as they are or in revers order. I did these tests always learning from scratch. It get errors ranging from 97% to 0.3%. The order of the samples should not generate such large variations.

I also get the same results if I use a pre learned model on a test data set.

haifengl commented 5 years ago

Our learning algorithm is different from libsvm and standard SMO. It is an active learning algorithm and learn the SVM in an online way (i.e. one sample a time). The training samples should be permutated first for any online learning algorithm.

hmf commented 5 years ago

@haifengl I am aware of this. I am assuming it is the LASVM algorithm. I have read the paper and taken a look at your code and the original code. I see you are using second order information to select the violating pair and also cache your support vectors and respective kernel values and gradients. As such I expect the SMILE implementation to be on par with SMO (libSVM) - as per the paper's results. However as I have noted above, for the simple data set, variability in errors ranges between 97% and 25% by simply reversing the order . All tests used the finalize step. It does seem correct.

Have you compared the performance of your algorithm with the original? I can try and run the original on the data sets I tried and get back to you on that. Have you any gold standard tests I can look at?

TIA

haifengl commented 5 years ago

I don't see that you randomized the order of your training data. Did I miss something?

hmf commented 5 years ago

Apologies for the lack of explanation. The tests I did that changed the order of the samples was a simple reverse. I used the code above but simply did a reverse of the training/test data. I provide the code just in case I made some obvious error.

On a side note: the LASVM code as per the description also randomizes the input. You code also seems to do this. However, I have noticed that it seems to always use the same seed. So its results are deterministic as a a function of the sequence of calls i.e: it produces the same results on the first call. Don't know if this is of any relevance.

haifengl commented 5 years ago

We let the user do the permutation so that they have the flexibility. A reverse is hardly a permutation. In your data, the data is sorted by their labels, which is the worst case for LASVM. You can call "smile.math.Math.permuate" for default implementation.

hmf commented 5 years ago

I was under the impression you already did that. See: https://github.com/haifengl/smile/blob/1d55b75f77c649d1de595d338e95e3eb1396b74a/core/src/main/java/smile/classification/SVM.java#L497 I am most probably wrong and interpreted the code incorrectly. However I have already started to write a function to load the original LASVM test files. I have also run the original LASVM code on these files (already randomized). I will do the same for SMILE's SVM and report the result here. Will be interesting to see what we get.

hmf commented 5 years ago

Apologies for the late update but this is going to take some time - have other tasks to complete. I ran the SMILE's LASVM in the following data:

https://gitlab.com/cese/adw/blob/master/data/lasvm/adult/adult.trn

It is one of the original test data sets of the LASVM algorithm. Unfortunately I get an OOM exception:

[info] SMILE LASVM comparison tests
[info] pt.inescn.model.LASVMSpec *** ABORTED ***
[info]   java.lang.OutOfMemoryError: Java heap space
[info]   at smile.math.DoubleArrayList.ensureCapacity(DoubleArrayList.java:77)
[info]   at smile.math.DoubleArrayList.add(DoubleArrayList.java:117)
[info]   at smile.classification.SVM$LASVM.process(SVM.java:822)
[info]   at smile.classification.SVM$LASVM.process(SVM.java:742)
[info]   at smile.classification.SVM$LASVM.learn(SVM.java:500)
[info]   at smile.classification.SVM$LASVM.learn(SVM.java:438)
[info]   at smile.classification.SVM.learn(SVM.java:1253)
[info]   at smile.classification.SVM.learn(SVM.java:1207)
[info]   at pt.inescn.model.LASVMSpec.$anonfun$new$31(LASVMSpec.scala:741)
[info]   at pt.inescn.model.LASVMSpec$$Lambda$285/1251945635.apply(Unknown Source)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FlatSpecLike$$anon$5.apply(FlatSpecLike.scala:1682)
[info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info]   at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685)
[info]   at org.scalatest.FlatSpecLike.invokeWithFixture$1(FlatSpecLike.scala:1680)
[info]   at org.scalatest.FlatSpecLike.$anonfun$runTest$1(FlatSpecLike.scala:1692)
[info]   at org.scalatest.FlatSpecLike$$Lambda$326/846392414.apply(Unknown Source)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
[info]   at org.scalatest.FlatSpecLike.runTest(FlatSpecLike.scala:1692)
[info]   at org.scalatest.FlatSpecLike.runTest$(FlatSpecLike.scala:1674)
[info]   at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685)
[info]   at org.scalatest.FlatSpecLike.$anonfun$runTests$1(FlatSpecLike.scala:1750)
[info]   at org.scalatest.FlatSpecLike$$Lambda$307/524776996.apply(Unknown Source)
[info]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393)
[info]   at org.scalatest.SuperEngine$$Lambda$308/802508698.apply(Unknown Source)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381)

The training data set has 123 features and 32562 samples. You can find the test here:

https://gitlab.com/cese/adw/blob/clean_tests/core/src/test/scala/pt/inescn/model/LASVMSpec.scala#L704

Please note that the data is used as-is: no scaling and/or shuffling. If you need any more details please ask.

haifengl commented 5 years ago

Then you should increase heap size.

hmf commented 5 years ago

I got the algorithm to execute with a heap of 15G. The result of the original LASVM is:

la SVM
______
loading "adult.trn"..  
examples: 32562   features: 122
set cache size 256
..0..100..200..300..400..500..600..700..800..900..1000..1100..1200..1300..1400..1500..1600..1700..1800..1900..2000..2100..2200..2300..2400..2500..2600..2700..2800..2900..3000..3100..3200..3300..3400..3500..3600..3700..3800..3900..4000..4100..4200..4300..4400..4500..4600..4700..4800..4900..5000..5100..5200..5300..5400..5500..5600..5700..5800..5900..6000..6100..6200..6300..6400..6500..6600..6700..6800..6900..7000..7100..7200..7300..7400..7500..7600..7700..7800..7900..8000..8100..8200..8300..8400..8500..8600..8700..8800..8900..9000..9100..9200..9300..9400..9500..9600..9700..9800..9900..10000..10100..10200..10300..10400..10500..10600..10700..10800..10900..11000..11100..11200..11300..11400..11500..11600..11700..11800..11900..12000..12100..12200..12300..12400..12500..12600..12700..12800..12900..13000..13100..13200..13300..13400..13500..13600..13700..13800..13900..14000..14100..14200..14300..14400..14500..14600..14700..14800..14900..15000..15100..15200..15300..15400..15500..15600..15700..15800..15900..16000..16100..16200..16300..16400..16500..16600..16700..16800..16900..17000..17100..17200..17300..17400..17500..17600..17700..17800..17900..18000..18100..18200..18300..18400..18500..18600..18700..18800..18900..19000..19100..19200..19300..19400..19500..19600..19700..19800..19900..20000..20100..20200..20300..20400..20500..20600..20700..20800..20900..21000..21100..21200..21300..21400..21500..21600..21700..21800..21900..22000..22100..22200..22300..22400..22500..22600..22700..22800..22900..23000..23100..23200..23300..23400..23500..23600..23700..23800..23900..24000..24100..24200..24300..24400..24500..24600..24700..24800..24900..25000..25100..25200..25300..25400..25500..25600..25700..25800..25900..26000..26100..26200..26300..26400..26500..26600..26700..26800..26900..27000..27100..27200..27300..27400..27500..27600..27700..27800..27900..28000..28100..28200..28300..28400..28500..28600..28700..28800..28900..29000..29100..29200..29300..29400..29500..29600..29700..29800..29900..30000..30100..30200..30300..30400..30500..30600..30700..30800..30900..31000..31100..31200..31300..31400..31500..31600..31700..31800..31900..32000..32100..32200..32300..32400..32500..[finishing]
nSVs=11262
||w||^2=1.05946e+06
kcalcs=399911154

la test
_______
loading model: 11262 svs
loading "adult.tst"..  
examples: 16282   features: 121
accuracy= 85.0817 (13853/16282)

See https://gitlab.com/cese/adw/blob/clean_tests/data/lasvm/adult

The result of SMILE's LASVM is:

[pool-1-thread-1-ScalaTest-running-LASVMSpec] INFO smile.classification.SVM - SVM finished the reprocess.
[pool-1-thread-1-ScalaTest-running-LASVMSpec] INFO smile.classification.SVM - 31330 support vectors, 4483 bounded

svm2.getSupportVectors.size() = 31330
Predict learn 1
Counted learn 1
svm2 error = 4.400837421417236% ; accuracy = 95.59916%
Predicted read and check test data
svm2.getSupportVectors.size() = 31330
Predicted test 1
Counted test 1
svm2 error = 60.83471477031708% ; accuracy = 39.165287%

As you can see it seems like LASVM is over-fitting in the training data set with about 3 times as many SVs (original algorithm uses 11262 SVs). You can see that the training accuracy is quite low. This seems to replicate the results aI had on smaller data sets.

Maybe I am doing something wrong?

haifengl commented 4 years ago

I test on svmguide1 data. I don't scale data to [0, 1] and thus use Gaussian(90). With C=1, I get the accuracy of 96%. I have not tuned the hyperparameters yet. What's your C value?

haifengl commented 4 years ago

For adult data, it is binary sparse. So I encode them into int[] of index and use BinarySparseGaussianKernel(31). The model can be trained with default heap size. And we get the accuracy of 82.12% with only 1026 support vectors. I guess that you encode differently?

haifengl commented 4 years ago

I found and fixed the root cause why it is sensitive to sample order. It is in v2 branch now. Thanks a lot!

hmf commented 4 years ago

@haifengl Apologies for replaying so late but I am being swamped with work. In regards to the svmguide1 data I had no problem.

For the adult data sets, I have these in text. Same for other test data sets. These are the original data sets used to test LASVM. Results are in the run.out file. The adult.trn.model has the results. So accuracy was "85.08" and no. support vectors was 11262. So your results are ok. For more test results please see the other data sets.

Right now I don't have time to test on the other data. Sorry.

Thanks for taking the time to check this.

haifengl commented 4 years ago

The accuracies are 96.73% on svmguide1, and 84.95% on adult data after I redesigned the kernel cache. They are on par with libsvm and lasvm packages. And the training is not sensitive to the order of samples now. We can close this ticket now. Thanks again for reporting this issue!

haifengl / smile