mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 404 forks source link

ensemble SVM project #288

Closed aydindemircioglu closed 9 years ago

aydindemircioglu commented 9 years ago

interesting, at least for dense vectors it seems to be much slower, roughly 50x for vectors of size a million.

hetong007 commented 9 years ago

Hello Aydin and Bernd,

I am back to work on GSoC now. Things I did so far:

I added a simple test file, and the script in inst/benchmark to reproduce the result. In the benchmark script, I also tuned the parameters. I didn't email the author because I observed a 99% accuracy rate on the ijcnn1 data set, while the corresponding number on the paper is 94%. The tuning process takes time, I hopefully can have all the results maybe in tomorrow.

Besides, there're several issues that I need some suggestions:

Regards to the clustering function:

Regards to the DCSVM:

berndbischl commented 9 years ago

There's no argument to mute the output message from RcppMLPACK::mlKmeans.

Have you tried suppressMessages / capture.output for a simple external solution? Or BBmisc::suppressAll?

hetong007 commented 9 years ago

BBmisc::suppressAll works, thank you!

berndbischl commented 9 years ago

RcppMLPACK::mlKmeans : t cannot predict the result given a list of centers.

Is there no returned model and a way to apply it to new data?

hetong007 commented 9 years ago

I don't know how's it for MLPACK, but for RcppMLPACK::mlKmeans there's only the number of clusters and the clustering labels. Seems it is not a well-packed function and the only advantage is the speed.

berndbischl commented 9 years ago

But given the centers it is not so hard to assign them to the nearest neighbor yourself?

hetong007 commented 9 years ago

Not hard if I know the distance metric. The best thing is the clustering function can also predict for new data, so that I can save this function to the training output, and then I can call it in predict.

berndbischl commented 9 years ago

I dont get the second sentence.

hetong007 commented 9 years ago

I can have result$cluster.fun in the training result, then in predict.clusterSVM I can call this function to predict the label for new data points.

I just looked into the code. The default metric seems to be squared euclidean distance. There are other options MLPACK but not exposed in the R function. So I think I can just use the euclidean distance.

aydindemircioglu commented 9 years ago

i still do not quite get it. what i expect: each clustering comes with a training and a predict function. the training function will produce some object (e.g. containing the centroids) that can be passed to its predict function, which will then assign the nearest cluster for a new dataset.

cluster.SVM should now work similarly: the train function needs a trainfun = cluster.train, and needs to save the object the cluster training will produce. then the predict function will take the saved object and call the cluster.predict function with this object.

i hope this is not too messed up explanationwise. but all the euclidiean distances etc are wrapped in some predict function and clusterSVM.predict only calls this one.

if MLpack works with eucliean distances (i.e. you call it that way), then you can use euclidean distances in the cluster predict function as well.

hetong007 commented 9 years ago

Hi Aydin,

Sorry I was unclear about the distance metric used by rcppmlpack. Since I saw the euclidean distance is now the only distance metric, I thought this problem can be resovled now. I will push the fix to this soon. So far I have no further question for clusterSVM.

Any suggestions on the DCSVM's two questions are appreciated!

And shame on me, I noticed there's the check box Check the 'bias augmentation' trick but I forget what is this trick :( Can you briefly remind me what part of the paper should I look at? Thank you.

aydindemircioglu commented 9 years ago

about DCSVM:

for now just write your own predict function for kkmeans, even if it is slow. question is anyway, if kernel kmeans improves things, if i remember correctly, the author said in the "followup" paper (DCpred++) that normal kmeans is enough (or at least the trade-off is not worth it). you should have a look at DCpred++ too, just browse through it to get a feeling if there is anything you could use.

on the other question, the answer is harder: i think both, e1071 as well as the DCSVM libsvm implementation both work with libsvm code, right? so technically, if you use the DCSVM libsvm code, you would basically need to write a e1071-style wrapper for it? if you instead use e1071 then you would need to port the changes from the DCSVM libsvm code to e1071. potentially this would allow others to use your code, but then again i am not sure if e1071 wants the libsvm code to be modified, this makes porting new versions of libsvm into e1071 more compilcated. for now i would actually try to fork e1071 and hack the things into it-- except if writing a wrapper for DSVM libsvm code is really much easier.

aydindemircioglu commented 9 years ago

bias augmentation, actually you pretend that adding a 'constant' dimension to your data is enough to mimic bias, i.e. you replace your data by adding an extra '1' everywhere, e.g. a vector (1 2 3) would become a 4-dim vector (1 2 3 1). there is a page on this in the pegasos paper by shalev-schwartz, i think, and also a (rather lengthy) discussion, why this approach is actually not the same as adding a bias. for now, i think you should ignore the checkbox and come back to it if the other things are working out.

hetong007 commented 9 years ago

Thank you for the suggestions. The license of e1071 is GPL-2 which says

Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.

If we port part of the code and merge it into our repository, does it violate the license? Anyway I am going to compare the differences between the two versions of libsvm.

I will put bias augmentation aside for now.

berndbischl commented 9 years ago

General hint, dont know if useful now: Look at FNN and the 2 get.knn functions in there. FNN is pretty fast.

aydindemircioglu commented 9 years ago

(edited a bit to make it more readable:) if you use GPL-2 code directly, then you must make your source-code GPL as well. you are allowed to modify the code (but not the license). this would break your idea of having your code under MIT licence, so putting e1071 directly into your package and keeping MIT wont work. alternatively, if you fork e1071, do all your modifications there (under GPL license) and then import this fork of yours (so both projects stay separate and swarmsvm only loads the fork), then you do not need to change your MIT license. this would be probably the better way. besides, libsvm and also DCSVM are also MIT, so if you want to use the DCSVM modifications directly in your package, that would be ok as well.

hetong007 commented 9 years ago

May I know how to exactly "import this fork of yours"? Do I put this forked repository as a submodule under the inst/ folder and make the modificaition in that place?

aydindemircioglu commented 9 years ago

actually no, then your swarmsvm would contain e1071 and need to have GPL again. you would really need to put up an own package fork on a separate git repo (lets call it e1071tong) and need your users to install this e.g. via install_github. or release your e1071tong on cran and load that. if i remeember correctly, you can put compiled binaries of the GPL code along your package, but cannot use the source code directly. this makes things complicated. probably both things suck, and thats why some people (including me) do not really like GPL. what license is kernlab? if its LGPL or MIT you could use that directly.

hetong007 commented 9 years ago

kernlab is GPL-2 as well. I might be a bit dumb here and I really need to know more about the license, but isn't this "fork - modification - release" process also against the GPL-2 rule?

aydindemircioglu commented 9 years ago

no. its the very basic of GPL. but the fork needs to be GPL-2 (and available to the public) as well!

aydindemircioglu commented 9 years ago

you could also think of it as kind of disease-- as soon as you use any part of the source code in your code, ALL your code is affected and needs to be/is GPL. but you are free e.g. to let the user download your fork, and your code only interacts with the fork, thats ok.

hetong007 commented 9 years ago

Oh I see. Thanks for the explanation.

Since our goal is to port it into mlr, that means I need to submit the so-called e1071tong to CRAN as well otherwise people cannot get the dependency installed easily.

I am reading the code from DCSVM and try to figure out the difference between e1071::svm, then estimate how much effort does it requires regardless of the license issues.

aydindemircioglu commented 9 years ago

alternatively you could change your MIT license to GPL. then you could use all of the code directly. and as mlr will only load your package, and not use its source code directly, this would work easily. (only the modificaions inside mlr needs to be BSD, but these are yours, so you have full control over that)

hetong007 commented 9 years ago

That sounds feasible! I can switch mine to GPL.

hetong007 commented 9 years ago

I read through the code from DCSVM, and there are indeed some modification I cannot reproduce in a day. So I just copied his modification on libsvm side and pushed them to my fork of e1071. I will compare his matlab implementation and the official one from libsvm, and introduce the differences into e1071. Hopefully this can port the algorithm into R.

hetong007 commented 9 years ago

Today I managed to port the hacked svm into SwarmSVM, based on e1071 and DCSVM. I next will fill up the skeleton of the main DCSVM. With the enhanced svm, the main algorithm is straightforward to implement.

hetong007 commented 9 years ago

I met a problem when testing dcsvm:

It is possible for the support vectors for one class being closed to each other. Therefore clustering on support vectors sometimes might cause the problem that some clusters contains only the support vectors from a single class, and the data points in this cluster belong to only one class. It is not possible to train an SVM on this subproblem.

I am wondering if it is a (theoretical) good idea to do stratified kernel clustering, similar as stratified splitting in cross validation.

aydindemircioglu commented 9 years ago

please give many more specific details: which dataset, what are your hyperparameters? what does the original code do in this case? technically, i would put all (multiplied-with-label) alphas in this cluster to C or -C, so that all predictions are exactly of the one given class. shouldn't that fix the problem? you could use stratified clustering, there are many such approaches in ensemble svms literature, but i'd keep it simple and working first before exploring alternatives.

hetong007 commented 9 years ago

I am testing the dcSVM function on the svmguide1 data. The clustering process has randomness and now I am not able to reproduce it with a seed yet.

The problem I met is after the clustering step, some clusters have only one class label, but the revised code (which take initial alpha values) asks for at least two classes.

aydindemircioglu commented 9 years ago

it is clear to me what you do, but unclear how you chose the number of clusters. obviously you can easily reproduce the situation by choosing a very large k for the clustering, in the extreme case just choose k to be the same as n, the number of data points (in practical terms: choose a large depth). again, what does the reference dc-svm code do in this case? does it also stop working?

hetong007 commented 9 years ago

Actually I cannot run his demo MATLAB code on my machine. There seems to be some bugs in the type of arguments in his code.

aydindemircioglu commented 9 years ago

mmh, i have it actually running under octave as well as matlab.

hetong007 commented 9 years ago

I have MATLAB R2015a on my lab machine under CentOS 6 environment. I re-compiled the code in libsvm-3.14-nobias/matlab by libsvm-3.14-nobias/matlab/make.m, and then I met the following error when running dcsvm/demo_ijcnn.m:

Error using svmtrain (line 234)
Y must be a vector or a character array.

Error in dcsvm_core (line 84)
        models{i} =
                svmtrain(trainy(idx==i),trainX(idx==i,:),libsvmcmd);

Error in dcsvm_rbf_train (line 34)
model = dcsvm_core(trainy, trainX, C, kernel_parameters, ncluster, level,
level_stop, kk, tol, mode, method, kernel);

In the function call svmtrain(trainy(idx==i),trainX(idx==i,:),libsvmcmd);, the second argument is a matrix trainX(idx==i,:), however svmtrain is defined as

function [svm_struct, svIndex] = svmtrain(training, groupnames, varargin)

And the error occurs at

if ~isvector(groupnames) && ~ischar(groupnames)
    error(message('stats:svmtrain:GroupNotVector'));
end

May I know where I did wrong? Thank you very much!

aydindemircioglu commented 9 years ago

i do not know what you do wrong. did you check that the data is loaded correctly? for my tests, i did not care about the demos, i did go ahead and used the wrappers the way i need, this is the code i have running

% for octave: change directory
isOctave = exist('OCTAVE_VERSION') ~= 0;

if (isOctave)
  chdir ('software/DCSVM/src/dcsvm')
  pkg load statistics    
end  

% add libsvm path
addpath('.')
addpath('../dcsvm');
addpath('../libsvm-3.14-nobias/matlab');

% read training set 
[trainy trainX] = libsvmread('/tmp/Rtmp7pAwgk/file3716845d710');

%% train/test rbf kernel SVM
ncluster = 64; 
gamma = 32;
C = 32;
earlyStopping = 1;

if (earlyStopping == 1)
    model = dcsvm_rbf_train(trainy, trainX, C, gamma, ncluster, 0.001);
else
    model = dcsvm_rbf_train_exact(trainy, trainX, C, gamma, 0.001);
end

actually it is a template, some variables are filled in by R and then the script is executed.

hetong007 commented 9 years ago

The demo has the same structure:

addpath('../libsvm-3.14-nobias/matlab');
maxNumCompThreads(1);

[trainy trainX] = libsvmread('../data/ijcnn1.train');
[testy, testX] = libsvmread('../data/ijcnn1.t');
%% train/test rbf kernel SVM
ncluster = 10;
gamma = 2;
C = 32;
fprintf('Start training Gaussian kernel SVM with early prediction\n', ncluster);
timebegin = cputime;
model = dcsvm_rbf_train(trainy, trainX, C, gamma, ncluster);

Anyway, I observed different behavior of my svm and e1071::svm when doing one-classification. Mine doesn't detect any support vectors. I am looking into this issue. If it gets resolved, then it can still find some support vectors on a cluster with only one label.

aydindemircioglu commented 9 years ago

then obviously i copied the demo and it runs fine. i do not think your approach with one-class-svm is the correct one. it will detect outliers. this is probably not what you want. i still think what i've proposed already will work better.

hetong007 commented 9 years ago

i would put all (multiplied-with-label) alphas in this cluster to C or -C, so that all predictions are exactly of the one given class.

Do you mean that I set all the alpha in this cluster to C*y, where C is the parameter controlling the slack variables?

If it is the case, then yi * ai always equal to C*yi^2 and the sign is always positive.

aydindemircioglu commented 9 years ago

then just put all alpha_i of the one class to zero and the others to C. at least for rbf kernel this should work, or?

hetong007 commented 9 years ago

then just put all alpha_i of the one class to zero and the others to C. at least for rbf kernel this should work, or?

Sorry I am not following this one. Could you please specify what is "the others"?

My understanding of your suggestion is: for a the cluster with only one label, we don't train svm. Instead, we treat all the data points as support vectors, and manually assign all the alpha_i = C, then when doing classification we sum y_i * alpha_i * Kernel = C * y_i * Kernel, and the summation will have the same sign of y_i.

hetong007 commented 9 years ago

Today I tried two following methods for the neural network calculation for the gater SVM.

  1. We have to optimize an unusual loss function, then I think we can change the definition of "error" in the training. The mathematical detail is in this link: http://mathb.in/38329
  2. The unusual loss function can be seen as an additional layer with the same activate function tanh. So in the last layer the output from SVM experts are fixed weights that doesn't change as the training process proceeds. We just propagate the error to the previous layers and update the weights.

All the attempts are based on the code of pacakge neuralnet. It is written in pure R thus easier to modify, and it supports multiple hidden layers as well.

For the output from the experts, the paper doesn't specify whether it is the {-1,1} classification label, or the decision value. I choose the classification label here.

However in the end, on the data set svmguide1, the results from both methods are awful. The output from the neural network tent to be 1 or -1, or extremely similar values like -0.99999, 0.9999 etc. It seems they are not fitting the loss function at all.

In order to find where I did wrong, do you mind to briefly check my methods, expecially the mathematical details? The first strategy is in the end not fitting the y_i, therefore I am afraid I am using a wrong induction.

I will also check the math induction, as well as the modified neuralnet code, to make sure I am calculate the correct thing.

hetong007 commented 9 years ago

Update: I figured out the reason for the bad output, that is some bugs in the svm training process, instead of the neural network. According to the paper, the SVM experts trained on a subset of data, and predict on the entire data set. But this afternoon some experts predict all the data to a single label. I must be stupid on some details but I just cannot fix it :(

Now as an temporal solution, I switch the SVM experts to logistic experts, and the output is the probability. Now the neural network works! The neural network is trained with the first strategy (hack the error term) described previously. On the svmguide1 data set with naive parameters it reaches an accuracy of 80%~81%, as a normal score. See the examples in the documentation of gaterSVM.

My current (and limited) observation of the output from this gater neural network is:

This is much simpler than I expected. In this sense, the iteration of re-assigning data points to experts is somewhat meaningless: there's almost no difference between weights, and the "sort experts according to weights" step does not matter much.

hetong007 commented 9 years ago

I found that the matlab code of dcSVM could be run on my Windows machine. And it turns out to work well with one-class subproblems. My transportation must be doing wrong at some point. I am comparing the copy-and-paste code in my fork of e1071, but have no clue yet.

aydindemircioglu commented 9 years ago

tong, one source of a problem might be that your cluster-predict method might not working the way you think. i think you are re-clustering instead of assigning points to the previously found clusters.

hetong007 commented 9 years ago

I think in the original paper the author talked about the 'early prediction', and this line of code refers to this method. I used the generated center so that it is assigning the label according to the existing centers.

Currently I have not tested this feature yet. I am just testing the final single SVM model. And I found out that the final model (trained based on the previous result of alpha) works much worse than simply train the model without any prior alpha. It might because I manually set the alpha for subproblems with only one class. I will find out how does the matlab code deal with this problem.

hetong007 commented 9 years ago

Hello Aydin and Bernd, do you know how to reproduce the environment in R CMD check --as-cran?

In the process of checking example code in the document, both local and travis tests get errors. But I cannot reproduce the errors. Besides, the local and travis tests give me different errors. The one on travis is about my modified svm(which is called alphasvm now), and the local one is about a segmentation fault when calling LiblineaR. They can be reproduced by R CMD check --as-cran everytime.

If I can reproduce them, then I think there's a big chance to fix them. Thank you!

berndbischl commented 9 years ago

Well what happens if you simply run the code without R CMD check? Does that throw an error as well?

aydindemircioglu commented 9 years ago

tong, i am completely puzzled again, which tests you did run locally vs travis/cran check. what i saw in the some minutes: a) cran check crashes with clusteredSVM. this seems to stem from the iris example in the header. this makes debugging hard, i suppose not only for me. for tests there is testthat. b) when putting the iris example into another file and executing this, i get no error when it is executed. c) when calling the same script with valgrind check "R -d valgrind", then i see an overflow error in some erf function. this keeps me from continuing debugging. d) it seems like a simple library(RcppMLPACK) produces that error. so something about that package looks cheesy. at least it keeps me from valgrind debugging. it might be that RcppMLPACK destroys some memory and that subsequent calls are therefore broken. try to remove the RcppMLPACK dependency go back to stats::kmeans and see, if it is still a problem.

hetong007 commented 9 years ago

@berndbischl If I run the code in R, there's no error. @aydindemircioglu Seems now we have 3 different errors in total on different platforms.

* checking examples ... ERROR
Running examples in ‘SwarmSVM-Ex.R’ failed
The error most likely occurred in:

> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: clusterSVM
> ### Title: Clustered Support Vector Machine
> ### Aliases: clusterSVM
>
> ### ** Examples
>
> data(iris)
> x=iris[,1:4]
> y=factor(iris[,5])
> train=sample(1:dim(iris)[1],100)
>
> xTrain=x[train,]
> xTest=x[-train,]
> yTrain=y[train]
> yTest=y[-train]
>
> csvm.obj = clusterSVM(x = xTrain, y = yTrain, sparse = FALSE,
+     centers = 2, iter.max = 1000,
+     valid.x = xTest,valid.y = yTest)
Time for Clustering: 0.00699999999999967 secs

Time for Transforming: 0.00199999999999978 secs

 *** caught segfault ***
address 0x21000013, cause 'memory not mapped'

Traceback:
 1: .C("trainLinear", as.double(W_ret), as.integer(labels_ret), as.double(if (sparse) data@ra else t(data)),     as.double(yC), as.integer(n), as.integer(p), as.integer(sparse),     as.integer(if (sparse) data@ia else 0), as.integer(if (sparse) data@ja else 0),     as.double(b), as.integer(type), as.double(cost), as.double(epsilon),     as.double(svr_eps), as.integer(nrWi), as.double(Wi), as.integer(WiLabels),     as.integer(cross), as.integer(verbose), PACKAGE = "LiblineaR")
 2: LiblineaR(data = tilde.x, target = y, type = type, cost = cost,     epsilon = epsilon, svr_eps = svr.eps, bias = bias, wi = wi,     cross = 0, verbose = (verbose >= 2))
 3: clusterSVM(x = xTrain, y = yTrain, sparse = FALSE, centers = 2,     iter.max = 1000, valid.x = xTest, valid.y = yTest)
aborting ...
aydindemircioglu commented 9 years ago

-R version 3.1.2 (2014-10-31) -ubuntu 15.04 64bit -[1] Rcpp_0.11.6 RcppMLPACK_1.0.10-2

i can reproduce that on a ubuntu 15.04 32bit in a virtual box, same version of libraries.

what does valgrind say about your alphasvm problem?