mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.63k stars 402 forks source link

ensemble SVM project #288

Closed aydindemircioglu closed 8 years ago

hetong007 commented 9 years ago

In this week (May 3th - May 9th), I am going to study the details of the first and third paper. In the next week, I will study the details of the second paper and correspoding matlab codes.

This time I will figure out the answer for the following questions through the study:

Following are some general questions:

aydindemircioglu commented 9 years ago

while compiling the necessary tools like kmeans, please use mlr-functions if possible, so instead of using e1071/kernlab for svm use the corresponding mlr-function. only if there is no mlr equivalent it makes sense to use an external package. (in that case please tell us, probably we should add an mlr-wrapper to that package).

primarily we will try to reproduce the results of the papers, so the benchmark data sets are quite fixed. if i remember correctly, all data sets in the papers are free for downloading. while developing you can also simply subsample larger data sets, if needed. (openml has also many data sets, see below)

i do not know if mlr keeps a record of the results (@bernd: do we?), but if you have time at the end, the results can be uploaded to http://openml.org/. bernd already has an R wrapper, so that interfacing with openml should be quite easy.

berndbischl commented 9 years ago

i do not know if mlr keeps a record of the results

No, we would use openml for that

@hetong007 if there is no problem please put the data sets you use on openml if they do not exits there yet, and there is no problem doing that.

hetong007 commented 9 years ago

I studied two papers. Here are the summary of them followed by two questions.

The first paper: Clustered Support Vector Machines

Training :

  1. Cluster the data with stats::kmeans.
  2. Transform the data according to the Eq. (4) - Eq. (7) in the original paper.
  3. Solve the new problem with a linear svm from LiblineaR::LiblineaR.

Prediction :

  1. Transform the test data according to the Eq. (4) - Eq. (7).
  2. Predict with the trained model.

Data used :

Possible Extension :

The clustering process is totally seperated with the svms, so I think the clustering algorithm don't have to be kmeans. I prefer to leave it as an parameter and set the default option as kmeans.

The third paper: Mixture of SVM Experts

Training :

  1. Randomly divide the data set into subsets.
  2. Train one svm on each subset with e1071::svm.
  3. Train a "gater" on the result of these svms. For the neural network mentioned in the paper, we probably need to implement our own version. Because github doesn't render latex formula, I put my reason here: http://mathb.in/35456.
  4. Reassign the data example according to the $w_m(x)$ from the "gater".
  5. Loop 2-4 until the stopping criterion is met.

Prediction

  1. Predict with all the svms on the full test set.
  2. Put the result through the "gater" to get the final result.

Data used

Preprocessing is clearly described in the paper.

Questions

On the algorithm

I hope I understand the training process of the third paper correctly. But I still have a questions for the algorithm:

On the implementation

aydindemircioglu commented 9 years ago

about your questions: after reviewing the algorithm it seems to be true that you need to retrain all the svms and all the neural networks in each step. you still can try warmstarts: keeping the alpha values of the data points for the svms and the weights for the neural network. this might not help much, but its worth a try.

i will need to rethink the question if we need to redefine backprop for implementing this. depends probably also on the package for NN you want to use. do you already selected one? do you want to go with the nnet package integrated in mlr? at least for some packages (i think about shark, the library we use at my institute) there is the possibility to define a custom loss function, and theano and the like can automatically derivate loss functions. that might help. but as i said i need to rethink that.

finally, about the benchmarks: personally i think yes, this must be included, but two points: usually one can 'boil' down the example data in the test, so that the whole package gets tested quite quickly. second in mlr there seems to be already the distinction between 'expensive' tests and normal tests, so it should be no bigger problem to put that there. (bernd has to decided ;)-- if there is no place, we also have the option to create an own package for the ensemble svm stuff and only link that package to mlr, so mlr will only do black-box testing and no benchmark-tests.

berndbischl commented 9 years ago

Regarding the benchmarks: Please put the code into a separate place. I can also give a lot of hints, help, code etc for this. We should have a hangout discussion about this.

berndbischl commented 9 years ago

To clarify: What should go into mlr, at some point, is the code and technical tests that show its validity. To really benchmark it, or certain aspects of it, is at least IMHO conceptually a different thing, although still part of your GSOC project :) So I would simply have a different repo for the scientific study that goes along with your implemented methods.

Does this make sense?

aydindemircioglu commented 9 years ago

yes, it does make sense.

but then again i fear a bit that such benchmarks might be lost in time, if they are placed in a separate study project. can we put the benchmarks into a demo/example applications instead?

and just out of curiosity, is a separate package an option? this way we could have all the tests benchmarks there and mlr only tests the integration/validation of that package, just like you do for the other external packages?

hangout is a very good idea, i would join and would ask hanna to join as well, she works with the svmbridge package, where we need some tests too ;)

berndbischl commented 9 years ago

Lets discuss this orally, much easier for me to explain the constraints.

hetong007 commented 9 years ago

Thanks for your comments!

@aydindemircioglu : In the third paper, there's an experiment on the number of iterations and we can see it is helpful. But I admit that I don't get the intuition that why having a loop is helping the accuracy. If the assignment of the data is the key point, then it is like a clustering algorithm using the weight information from the gater. To me what makes more sense is trying to use the boosting framework that on each iteration we improve the svms and gater by the current error, rather than discarding everything you trained in the last round.

On the neural network choice, I don't know what is the current best standalone package in R, maybe nnet? H2O is very flexible and widely used in kaggle but it requires its own backend libraries. If there's no library which supports customized loss function, I guess we need to re-implement it.

@berndbischl For the hangout, I am available starting from the 8 a.m. in Vancouver, which is 5 p.m. in Germany. If in case you are all available at/before 10 a.m. in Germany (which is at/before 1 a.m. in Vancouver) I can also join you for an one-hour meeting.

berndbischl commented 9 years ago

Hi, we need to plan these meetings a little bit ahead of time, and please by email. Though: I am avail tonight from 22 - 1 CET.

aydindemircioglu commented 9 years ago

maybe we should wait with the hangout and have a big meeting again, probably after zach's return?

aydindemircioglu commented 9 years ago

yes, it is not intutive, you are right. and i did not explicitly thought about that. on the other hand, it is not completely weird, after all, though you 'throw away' the trained neural network and the svm, you keep the clustering. and finally this is what may be the most important thing, as you can see in the DCSVM paper, too. furthermore, just to put the paper into perspective: there is a paper from 2004, http://www.ncbi.nlm.nih.gov/pubmed/15165393, that show that 'primitive' random trees are outperforming this scheme. then again, collobert in his phd thesis (and in a follow up paper, see http://www.worldscientific.com/doi/abs/10.1142/S0218001403002411) tried to replace the global gater with a local one, hopefully to get a faster and better algorithm, but he concludes in his thesis "The generalization performance of a hard mixture with a local gater were in all cases worst than what we obtained with a global gater, especially with SVM experts". so.... ;)

the reason i did choose that approach nonetheless is because it is an iterative approach, something most other ensembles do not use. so maybe one can use that somewhere else. i am also a bit curious how the final clustering of the approach does look like? will it look nice like kmeans or will it be weird? and as you said, it should be possible to improve on this, and you are free to do so later on. (also, in case you prefer some other approaches/papers, feel free to propose one.)

on the nn package, i have no clue. i'd ideally prefer one which has support for deep nets, but it seems that there is no such package yet :/ (i wish we had a gsoc project on that).

hetong007 commented 9 years ago

I just got back to Vancouver and had a good sleep to get rid of the jetlag. In this week(starting from today) I am planning to implement the first algorithm since it is the easiest one. Hopefully I will make serveral commits while each of them represents a complete step of

The Second Paper: Divide and Conquer SVM

In my opinion this paper takes one step further than the first one. The basic structure of this algorithm is divide and conquer. With a fixed number of sub levels, the algorithm can utilize the result from the sub-levels and speed up the problem size.

Training :

  1. Sample points from the support vectors(or the whole data set on the bottom level) to run kernel kmeans. The kernel kmeans can be done with kernlab::kkmeans.
  2. For each cluster, utilize the corresponding weights from the sub level to refine the weights. Here we use the coordinate descent method to solve it. As far as I know that e1071::svm doesn't have a "continue training" interface for us to initialize the weights. therefore it is left to be implemented by me.
  3. Refine weights by solving SVM on all the support vectors using the current weights.
  4. Solve SVM on the whole data using the refined weights as the initial points.

Prediction :

Data used

  1. ijcnn1
  2. cifar
  3. census
  4. webspam
  5. covtype
  6. kddcup99
  7. mnist8m

Questions

  1. Apologizes that I forgot whether we made an agreement on the benchmark codes. Where should I keep the codes that are reproducing the result from experiments?
  2. Is there any SVM implementation in R that can be initialized by the weights?
aydindemircioglu commented 9 years ago

nice to have you back :)

about the csvm, you might also want to take a look at what bing he did-- she has her code here https://github.com/HeBing/mlr_GSoC2015. might save you a little bit of time.

about the alpha-initialization, i can hack that for you into either e1071 or via our svmbridge (that is currently under construction, which i would prefer, but in the end its the same modification to libsvm). i can do that only by the end of this week, please ping me then again. i also wonder what is best for DCSVM, the first paper is a little bit vague on using kernel kmeans versus only kmeans. you should make that chooseable-- see the 'followup' paper, Fast Prediction for Large-Scale Kernel Machines and the corresponding code. i will also look over that next weekend.

finally, as said, create your very own git repository, and do all your work there. as discussed, it is also ok for us to create our own ensemble-svm library and just put 'links' into mlr (just as mlr uses e1071/kernlab for svm computations), and some black-box unittesting, if calling our lib works. then you can do all your benchmarking and other tests in this library.

and unluckily i did not had time to look into the mixture-of-expert-thing, hopefully soon, but its at least two weeks from now.

hetong007 commented 9 years ago

Oh I am glad to know that Bing is working as well. I hope she's doing well.

I like the idea of having an ensemble-svm library! I am still trying to be more familiar with mlr. So having a standalone repository means I don't have to code everything in the style of an mlr learner. It is making things easier for me. And with a higher degree of freedoms, I can now hack e1071::svm as well.

I will try to use xgboost to look into the mixture-of-expert thing, because one can specify the objective with it.

aydindemircioglu commented 9 years ago

no, bing he is not part of gsoc anymore, she had coded that just for fun in the proposal phase. that's really a pity, as she wanted to take a deeper look into designing of a general ensemble-(svm)-library. i hope she's well, too.

hetong007 commented 9 years ago

That's a bad news :(

Today I created a repository for ensemble svm: https://github.com/hetong007/EnsembleSVM. I just wrote a simple CSVM model and a test. The test works but it seems that e1071::svm is very slow when the dimension is high even with a linear kernel. I cannot get a training time of 0.32 secs on the svmguide1 data set. The test command needs 200 secs on a not bad CPU.

aydindemircioglu commented 9 years ago

thats strange, as e1071::svm uses libsvm. probably this is because of this sparse data format, that is awfully slow. probably our svmbridge package is faster, need to try that out later this week.

hetong007 commented 9 years ago

I compared sparse and non-sparse version of the transformed x. But both are slow (more than 150 secs). I looked into e1071:::svm.default and figured out that the bottleneck is the .C call.

The auc is good, the value is above 0.97 on the test set.

aydindemircioglu commented 9 years ago

sorry, probably i misunderstood you: when you said "e1071::svm is very slow" i thought you had compared it directly with (command-line) libsvm? (if not, its a good idea to do that) in general libsvm is a quite slow solver, so if that was your observation, everything is ok. but if you have a large gap between libsvm and e1071::svm that would surprise me, and we would need to look deeper into why this is the case.

hetong007 commented 9 years ago

Sorry that I was not reading the paper careful enough. They used the Liblinear but not LibSVM. Therefore I re-tested with the R package LiblineaR and the speed is greatly improved: currently around 2.5 secs. The accuracy is above 90% which is actually better than the result in the paper.

I will try to reproduce some other experiments.

berndbischl commented 9 years ago

Ok, that Liblinear is a lot faster is to be expected. Tong, you should update your algo description, whether you always want to use the linear kernel for fitting the SVM.

hetong007 commented 9 years ago

Thanks for reminding. I think one of the key points for this algorithm is the speed, so I will stick to the LiblineaR package.

berndbischl commented 9 years ago

I think one of the key points for this algorithm is the speed, so I will stick to the LiblineaR package.

Sure. But we also need an "appropriate" SVM model. Due to the prevoius clustering and then local modelling we expect to need only a linear SVM in the clusters right? Are we reasonably sure that we never want an RBF-SVM for the local clusters? Because if so, you can proceed as outlined.

hetong007 commented 9 years ago

The argument made in the paper is that the "clustering-linear" model is essentially faster and the experiment result is as good as (if not better than) the kernel SVM. Currently the observation in my tests is supporting it.

The "clustering-kernel" model is similar to the divide and conquer SVM with only one layer. I think if one wants to use the kernel SVM, the DCSVM is a better choice.

hetong007 commented 9 years ago

Today I made the repository into an installable R package (by devtools) with a runnable test which is reproducing the experiments in the paper: https://github.com/hetong007/EnsembleSVM/blob/master/tests/test_cluster_SVM.R.

Efficiency

Currently the bottleneck of the efficiency is the kmeans. For the largest MNIST OvsE data (60000 * 782):

The running time in the paper is less than 40 secs.

Accuracy

The LiblineaR can only predict the label, not the probability. Based on the current implementation, the score is about the same level as the paper.

berndbischl commented 9 years ago

OK I agree that we need a faster cluster algo. But wrt the time comparison. You use different hardware than in the paper right? How do you thing you would be able to compare absolute numbers to the paper? Do you have at least n impression how much faster / slower your stuff is?

aydindemircioglu commented 9 years ago

mmh, i think that the kmeans from stat is somehow not doing as i expect. if you use say iter.max = 5, then i'd expect kmeans to be quite fast. but calling kmeans with iter.max = 5 does not seem to change much, e.g.

microbenchmark (kmeans(x, iter.max = 5, nstart = 1, centers = 8), times = 1) Unit: seconds expr min lq mean kmeans(x, iter.max = 5, nstart = 1, centers = 8) 36.60263 36.60263 36.60263 median uq max neval 36.60263 36.60263 36.60263 1 Warning message: did not converge in 5 iterations microbenchmark (kmeans(x, iter.max = 1000, nstart = 1, centers = 8), times = 1) Unit: seconds expr min lq mean kmeans(x, iter.max = 1000, nstart = 1, centers = 8) 44.9682 44.9682 44.9682 median uq max neval 44.9682 44.9682 44.9682 1

maybe kmeans first does some preprocessing step that takes nearly all the time? can you take a look into the code for kmeans, why the difference is so small for so different iter.max values? at least, when using Lloyds algorithm the difference is much more pronounced:

microbenchmark (kmeans(x, iter.max = 1000, nstart = 1, centers = 8, algorithm = "Lloyd"), times = 1) Unit: seconds expr kmeans(x, iter.max = 1000, nstart = 1, centers = 8, algorithm = "Lloyd") min lq mean median uq max neval 99.35879 99.35879 99.35879 99.35879 99.35879 99.35879 1 microbenchmark (kmeans(x, iter.max = 10, nstart = 1, centers = 8, algorithm = "Lloyd"), times = 1) Unit: seconds expr kmeans(x, iter.max = 10, nstart = 1, centers = 8, algorithm = "Lloyd") min lq mean median uq max neval 11.87571 11.87571 11.87571 11.87571 11.87571 11.87571 1 Warning message: did not converge in 10 iterations

maybe using Lloyds algorithm is the way to go with iterations around 150.

hetong007 commented 9 years ago

@berndbischl Comparing to the original paper, the running time of our result is at the same level except for the largest data set (the one I previously listed). I thought it would be better with kmeans implementation which is optimized for sparse matrix. However I tested the result with

sk.Fun = function(...) {
    result = skmeans(...)
    result$centers = result$prototypes
    result
}
csvm.obj = clusterSVM(x = mnistoe[,-1], y = mnistoe[,1], cluster.FUN = sk.Fun, 
                      k = 8, seed = 512, method='pclust',control = list(maxiter=150),
                      valid.x = mnistoe.t[,-1],valid.y = mnistoe.t[,1])

As the number of iteration increases, it is even worse.

@aydindemircioglu The major time consumption in kmeans is calling the C/Fortran core functions. It seems there's no specific preprocessing before that. However it seems that the default 'Hartigan-Wong' algorithm does have some preprocessing steps. I tested different algorithms and here's the table of running time records:

iter.max=5 iter.max=150
Hartigan-Wong 47.186 61.632
Lloyd 10.443 167.407
Forgy 33.979 171.264
MacQueen 11.619 77.346
skmeans::pclust 12.062 206.351

Another observation is, for the three algorithms in the middle, they converged to the same result with the same random seed.

aydindemircioglu commented 9 years ago

ok, that does not really look that good. if you feel like it, you could try variations of kmeans like kmeans++ (see flexclust package) or alternatively approximate kmeans just like in the DCSVM paper, i.e. take a subset (say 10%) of your data, apply kmeans to that and then assign to every data point the label of the nearest centroid.

nonetheless we should not waste too much time in why the kmeans is not as fast as in the paper, if we have roughly the same level of accuracy we should first wrap it up in a usable state. we can come back to the kmeans problem later on. so maybe the next step is to spend some time to make the package more 'round' and try to link it to mlr?

hetong007 commented 9 years ago

I agree. May I know your specific opinions on how to make the package more 'round'?

Maybe from now to the next weekend, I will also try to make a learner in mlr as the link to this function. For the unit test for this new learner, I will directly use the existing data set in mlr to make it 'light'.

aydindemircioglu commented 9 years ago

basically you have two options: by the end of the project move all code directly to mlr. or leave it in its own package and use it in mlr like other packages say e1071. for the last option you need to upload your package to cran, that means you need all the stuff like vignettes, manuals etc. (though you'd also need this if you put your code into mlr, and either way you need some documentation as well in mlr). so in case you want to go the cran package way, i personally would start just initializing all the things you need like the vignettes directory etc.

hetong007 commented 9 years ago

Thank you. I think the second one is more feasible. Sometimes users might think mlr is a bit 'heavy' and they just want to try a single function of it. In this sense, I will build a skeleton which can pass R CMD check --as-cran soon. And of course before the submission I will remove the huge data files, but now I would like to leave them there for convenience.

aydindemircioglu commented 9 years ago

did you had time to look into the kmeans problem? i just found this here http://stackoverflow.com/questions/21177074/why-is-kmeans-so-slow-on-high-spec-ubuntu-machine-but-not-windows could be a severe performance-bug :/ can you just for testing try out the MLPACK Kmeans? actually as a demo function it takes no other options than the number of iterations, but we will get a rough feeling if that helps. clustering the dataset from the above question i get the result roughly in a second while the R kmeans takes 50s. alternatively we could look into finding other kmeans implementations or getting a nice kmeans c++ library wrapped as a side project. i guess there are many nice c++ libraries out there, i just saw http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/yakmo/ and whatever they do, maybe its faster-- but the code is so ugly and somehow c-like, i do not feel particularly inclined to wrap it (the weather is finally good here, on a rainy cold day probably i'd do that).

aydindemircioglu commented 9 years ago

i just did a small test on mnist, seems that yakmo takes roughly 5,5s while MLPACK needs 9s on my machine (did NOT check the quality, they all claim to have converged). so it might be worthwhile to have a yakmoR package. i will probably give it a shot at some rainy day over the next week(s).

hetong007 commented 9 years ago

I observed the same problem on my CentOS machine.

I used the RcppMLPACK package and wrote an kmeans function according to its vignette. The running time on the MNIST is roughly 20 secs. Comparing to the 60 secs of stats::kmeans it is a great improvement. One possible problem for this package is that it asks for Rcpp, RcppArmadillo and a not-too-old gcc compiler. The default compiler on my machine is gcc 4.4.7 which failed to compile it. I successed after switched to 4.9.2.

If we set it as the default clustering algorithm, I am not sure how is it gonna affect the experiences of users.

aydindemircioglu commented 9 years ago

yeah, this sounds suboptimal, though (in my opinion) anything below gcc 4.7 is pretty much outdated. at least it is good to hear that kmeans is now closer to what we expect. i started now with that yakmoR package, i can already give back the centroids to R, but there is still a lot things to do. what are you up to currently? how is the integration into mlr going? did you already started on DCSVM and the modifications you need to pass on the 'warm-start' alphas?

hetong007 commented 9 years ago

Thank you for this yakmoR package! It sounds promising.

I am currently working on the mlr integration. I have an argument, cluster.FUN, of function type which takes the customized clustering function. Because of different formats of the results in different algorithms, now it requires users to define their own functions containing result$cluster and result$centers. However I think it is not very mlr-like. So I have a question about it:

aydindemircioglu commented 9 years ago

very promising... i just spend couple of hours to find a bug in my code which seems not to be there. in other words, when predicting the labels 'while' training yields different cluster results than training the model first and then predicting it by a second call. seems that yakmo has some kind of bug. :/ nonetheless now i should be able to wrap that code up and give it a try in your code.

unluckily i am not the expert for things regarding mlr integration, so @berndbischl can probably help out? from my (beginners) viewpoint i'd code the package in a way kmeans clustering can work properly without extra wrapping. every other clustering has to obey that signature (are there any other default clustering algos in R?)-- as for mlr, one needs then to write a wrapper that transforms the mlr-kind of clustering into one like kmeans, then call clustersvm and turn the results back. so i do not see any way to get all this in one single argument. as said, bernd will for sure have a better answer.

aydindemircioglu commented 9 years ago

a short note: your predict function computes the labels by itself. should't it instead rely on the clustering function, as they can (like kernel k-means) might use different distance functions?

hetong007 commented 9 years ago

For now my code is by default using the stats::kmeans, it has result$cluster and result$centers so doesn't need a wrapper.

The distance is a good point I overlooked! In the predict function I still transform the new data based on the clustering centers from the training data, but using the euclidean distance. I think I can store the clustering function in the training result, and call it in the predict function.

For functions like stats::kmeans they can cluster based on the trained centers. But for the customized clustering functions I am afraid it is asking more. This leads to the integration problem: if we offer some good enough clustering functions, users will be less bothered to implement/wrap other better algorithms.

berndbischl commented 9 years ago

should we schedule a meeting so I can answer my questions? I guess that is easier....

aydindemircioglu commented 9 years ago

yes, might be a good idea, is 22h CET ok for both of you?

berndbischl commented 9 years ago

today and tomorrow should work at that time for me

hetong007 commented 9 years ago

This time works for me today.

berndbischl commented 9 years ago

ok, will meet in 3 hours then

hetong007 commented 9 years ago

Thank you for the input! Here's the summary for today's meeting.

aydindemircioglu commented 9 years ago

could you add the problem with the predictions to your list, where you cannot use just euclidean metrics? is the Diagonal really needed in rowl2norm, wouldnt just a normal multiplication do? and can you start to document also the utility functions? it would possibly also a good idea to apply the 'bias augmentation' trick to allow for computations with bias. i wonder if that changes anything significantly. thanks! :)

hetong007 commented 9 years ago

The Diagonal trick is slightly faster for sparse matrix. Experiments on the largest data set mnist also proof that.