Closed aydindemircioglu closed 9 years ago
Hello Aydin and Bernd,
I am back to work on GSoC now. Things I did so far:
I added a simple test file, and the script in inst/benchmark to reproduce the result. In the benchmark script, I also tuned the parameters. I didn't email the author because I observed a 99% accuracy rate on the ijcnn1 data set, while the corresponding number on the paper is 94%. The tuning process takes time, I hopefully can have all the results maybe in tomorrow.
Besides, there're several issues that I need some suggestions:
Regards to the clustering function:
RcppMLPACK::mlKmeans
.
RcppMLPACK
but privately the author told me that he's been occupied by his internship. Maybe it is faster for me to try to hack and open a pull request.Regards to the DCSVM:
kernlab::kkmeans
cannot predict the result given the kernel and a list of centers.
kernlab
offers a kernelf
function to extract the kernel function, but the function for gaussian kernel is not vectorized. kernlab:ksvm
and e1071::svm
don't take support vectors to initialize the model. However after I read the author's matlab code, I found that there exists an C implementation which takes it.
e1071::svm
?There's no argument to mute the output message from RcppMLPACK::mlKmeans.
Have you tried suppressMessages / capture.output for a simple external solution? Or BBmisc::suppressAll?
BBmisc::suppressAll
works, thank you!
RcppMLPACK::mlKmeans : t cannot predict the result given a list of centers.
Is there no returned model and a way to apply it to new data?
I don't know how's it for MLPACK
, but for RcppMLPACK::mlKmeans
there's only the number of clusters and the clustering labels. Seems it is not a well-packed function and the only advantage is the speed.
But given the centers it is not so hard to assign them to the nearest neighbor yourself?
Not hard if I know the distance metric. The best thing is the clustering function can also predict for new data, so that I can save this function to the training output, and then I can call it in predict
.
I dont get the second sentence.
I can have result$cluster.fun
in the training result, then in predict.clusterSVM
I can call this function to predict the label for new data points.
I just looked into the code. The default metric seems to be squared euclidean distance. There are other options MLPACK
but not exposed in the R function. So I think I can just use the euclidean distance.
i still do not quite get it. what i expect: each clustering comes with a training and a predict function. the training function will produce some object (e.g. containing the centroids) that can be passed to its predict function, which will then assign the nearest cluster for a new dataset.
cluster.SVM should now work similarly: the train function needs a trainfun = cluster.train, and needs to save the object the cluster training will produce. then the predict function will take the saved object and call the cluster.predict function with this object.
i hope this is not too messed up explanationwise. but all the euclidiean distances etc are wrapped in some predict function and clusterSVM.predict only calls this one.
if MLpack works with eucliean distances (i.e. you call it that way), then you can use euclidean distances in the cluster predict function as well.
Hi Aydin,
Sorry I was unclear about the distance metric used by rcppmlpack. Since I saw the euclidean distance is now the only distance metric, I thought this problem can be resovled now. I will push the fix to this soon. So far I have no further question for clusterSVM
.
Any suggestions on the DCSVM's two questions are appreciated!
And shame on me, I noticed there's the check box Check the 'bias augmentation' trick
but I forget what is this trick :( Can you briefly remind me what part of the paper should I look at? Thank you.
about DCSVM:
for now just write your own predict function for kkmeans, even if it is slow. question is anyway, if kernel kmeans improves things, if i remember correctly, the author said in the "followup" paper (DCpred++) that normal kmeans is enough (or at least the trade-off is not worth it). you should have a look at DCpred++ too, just browse through it to get a feeling if there is anything you could use.
on the other question, the answer is harder: i think both, e1071 as well as the DCSVM libsvm implementation both work with libsvm code, right? so technically, if you use the DCSVM libsvm code, you would basically need to write a e1071-style wrapper for it? if you instead use e1071 then you would need to port the changes from the DCSVM libsvm code to e1071. potentially this would allow others to use your code, but then again i am not sure if e1071 wants the libsvm code to be modified, this makes porting new versions of libsvm into e1071 more compilcated. for now i would actually try to fork e1071 and hack the things into it-- except if writing a wrapper for DSVM libsvm code is really much easier.
bias augmentation, actually you pretend that adding a 'constant' dimension to your data is enough to mimic bias, i.e. you replace your data by adding an extra '1' everywhere, e.g. a vector (1 2 3) would become a 4-dim vector (1 2 3 1). there is a page on this in the pegasos paper by shalev-schwartz, i think, and also a (rather lengthy) discussion, why this approach is actually not the same as adding a bias. for now, i think you should ignore the checkbox and come back to it if the other things are working out.
Thank you for the suggestions. The license of e1071
is GPL-2 which says
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
If we port part of the code and merge it into our repository, does it violate the license? Anyway I am going to compare the differences between the two versions of libsvm.
I will put bias augmentation aside for now.
General hint, dont know if useful now: Look at FNN and the 2 get.knn functions in there. FNN is pretty fast.
(edited a bit to make it more readable:) if you use GPL-2 code directly, then you must make your source-code GPL as well. you are allowed to modify the code (but not the license). this would break your idea of having your code under MIT licence, so putting e1071 directly into your package and keeping MIT wont work. alternatively, if you fork e1071, do all your modifications there (under GPL license) and then import this fork of yours (so both projects stay separate and swarmsvm only loads the fork), then you do not need to change your MIT license. this would be probably the better way. besides, libsvm and also DCSVM are also MIT, so if you want to use the DCSVM modifications directly in your package, that would be ok as well.
May I know how to exactly "import this fork of yours"? Do I put this forked repository as a submodule under the inst/
folder and make the modificaition in that place?
actually no, then your swarmsvm would contain e1071 and need to have GPL again. you would really need to put up an own package fork on a separate git repo (lets call it e1071tong) and need your users to install this e.g. via install_github. or release your e1071tong on cran and load that. if i remeember correctly, you can put compiled binaries of the GPL code along your package, but cannot use the source code directly. this makes things complicated. probably both things suck, and thats why some people (including me) do not really like GPL. what license is kernlab? if its LGPL or MIT you could use that directly.
kernlab is GPL-2 as well. I might be a bit dumb here and I really need to know more about the license, but isn't this "fork - modification - release" process also against the GPL-2 rule?
no. its the very basic of GPL. but the fork needs to be GPL-2 (and available to the public) as well!
you could also think of it as kind of disease-- as soon as you use any part of the source code in your code, ALL your code is affected and needs to be/is GPL. but you are free e.g. to let the user download your fork, and your code only interacts with the fork, thats ok.
Oh I see. Thanks for the explanation.
Since our goal is to port it into mlr
, that means I need to submit the so-called e1071tong
to CRAN as well otherwise people cannot get the dependency installed easily.
I am reading the code from DCSVM and try to figure out the difference between e1071::svm
, then estimate how much effort does it requires regardless of the license issues.
alternatively you could change your MIT license to GPL. then you could use all of the code directly. and as mlr will only load your package, and not use its source code directly, this would work easily. (only the modificaions inside mlr needs to be BSD, but these are yours, so you have full control over that)
That sounds feasible! I can switch mine to GPL.
I read through the code from DCSVM, and there are indeed some modification I cannot reproduce in a day. So I just copied his modification on libsvm side and pushed them to my fork of e1071. I will compare his matlab implementation and the official one from libsvm, and introduce the differences into e1071
. Hopefully this can port the algorithm into R.
Today I managed to port the hacked svm into SwarmSVM
, based on e1071
and DCSVM. I next will fill up the skeleton of the main DCSVM. With the enhanced svm, the main algorithm is straightforward to implement.
I met a problem when testing dcsvm:
It is possible for the support vectors for one class being closed to each other. Therefore clustering on support vectors sometimes might cause the problem that some clusters contains only the support vectors from a single class, and the data points in this cluster belong to only one class. It is not possible to train an SVM on this subproblem.
I am wondering if it is a (theoretical) good idea to do stratified kernel clustering, similar as stratified splitting in cross validation.
please give many more specific details: which dataset, what are your hyperparameters? what does the original code do in this case? technically, i would put all (multiplied-with-label) alphas in this cluster to C or -C, so that all predictions are exactly of the one given class. shouldn't that fix the problem? you could use stratified clustering, there are many such approaches in ensemble svms literature, but i'd keep it simple and working first before exploring alternatives.
I am testing the dcSVM
function on the svmguide1
data. The clustering process has randomness and now I am not able to reproduce it with a seed yet.
The problem I met is after the clustering step, some clusters have only one class label, but the revised code (which take initial alpha values) asks for at least two classes.
it is clear to me what you do, but unclear how you chose the number of clusters. obviously you can easily reproduce the situation by choosing a very large k for the clustering, in the extreme case just choose k to be the same as n, the number of data points (in practical terms: choose a large depth). again, what does the reference dc-svm code do in this case? does it also stop working?
Actually I cannot run his demo MATLAB code on my machine. There seems to be some bugs in the type of arguments in his code.
mmh, i have it actually running under octave as well as matlab.
I have MATLAB R2015a on my lab machine under CentOS 6 environment. I re-compiled the code in libsvm-3.14-nobias/matlab
by libsvm-3.14-nobias/matlab/make.m
, and then I met the following error when running dcsvm/demo_ijcnn.m
:
Error using svmtrain (line 234)
Y must be a vector or a character array.
Error in dcsvm_core (line 84)
models{i} =
svmtrain(trainy(idx==i),trainX(idx==i,:),libsvmcmd);
Error in dcsvm_rbf_train (line 34)
model = dcsvm_core(trainy, trainX, C, kernel_parameters, ncluster, level,
level_stop, kk, tol, mode, method, kernel);
In the function call svmtrain(trainy(idx==i),trainX(idx==i,:),libsvmcmd);
, the second argument is a matrix trainX(idx==i,:)
, however svmtrain
is defined as
function [svm_struct, svIndex] = svmtrain(training, groupnames, varargin)
And the error occurs at
if ~isvector(groupnames) && ~ischar(groupnames)
error(message('stats:svmtrain:GroupNotVector'));
end
May I know where I did wrong? Thank you very much!
i do not know what you do wrong. did you check that the data is loaded correctly? for my tests, i did not care about the demos, i did go ahead and used the wrappers the way i need, this is the code i have running
% for octave: change directory
isOctave = exist('OCTAVE_VERSION') ~= 0;
if (isOctave)
chdir ('software/DCSVM/src/dcsvm')
pkg load statistics
end
% add libsvm path
addpath('.')
addpath('../dcsvm');
addpath('../libsvm-3.14-nobias/matlab');
% read training set
[trainy trainX] = libsvmread('/tmp/Rtmp7pAwgk/file3716845d710');
%% train/test rbf kernel SVM
ncluster = 64;
gamma = 32;
C = 32;
earlyStopping = 1;
if (earlyStopping == 1)
model = dcsvm_rbf_train(trainy, trainX, C, gamma, ncluster, 0.001);
else
model = dcsvm_rbf_train_exact(trainy, trainX, C, gamma, 0.001);
end
actually it is a template, some variables are filled in by R and then the script is executed.
The demo has the same structure:
addpath('../libsvm-3.14-nobias/matlab');
maxNumCompThreads(1);
[trainy trainX] = libsvmread('../data/ijcnn1.train');
[testy, testX] = libsvmread('../data/ijcnn1.t');
%% train/test rbf kernel SVM
ncluster = 10;
gamma = 2;
C = 32;
fprintf('Start training Gaussian kernel SVM with early prediction\n', ncluster);
timebegin = cputime;
model = dcsvm_rbf_train(trainy, trainX, C, gamma, ncluster);
Anyway, I observed different behavior of my svm
and e1071::svm
when doing one-classification
. Mine doesn't detect any support vectors. I am looking into this issue. If it gets resolved, then it can still find some support vectors on a cluster with only one label.
then obviously i copied the demo and it runs fine. i do not think your approach with one-class-svm is the correct one. it will detect outliers. this is probably not what you want. i still think what i've proposed already will work better.
i would put all (multiplied-with-label) alphas in this cluster to C or -C, so that all predictions are exactly of the one given class.
Do you mean that I set all the alpha in this cluster to C*y
, where C
is the parameter controlling the slack variables?
If it is the case, then yi * ai
always equal to C*yi^2
and the sign is always positive.
then just put all alpha_i of the one class to zero and the others to C. at least for rbf kernel this should work, or?
then just put all alpha_i of the one class to zero and the others to C. at least for rbf kernel this should work, or?
Sorry I am not following this one. Could you please specify what is "the others"?
My understanding of your suggestion is: for a the cluster with only one label, we don't train svm. Instead, we treat all the data points as support vectors, and manually assign all the alpha_i = C
, then when doing classification we sum y_i * alpha_i * Kernel = C * y_i * Kernel
, and the summation will have the same sign of y_i
.
Today I tried two following methods for the neural network calculation for the gater SVM.
tanh
. So in the last layer the output from SVM experts are fixed weights that doesn't change as the training process proceeds. We just propagate the error to the previous layers and update the weights.All the attempts are based on the code of pacakge neuralnet
. It is written in pure R thus easier to modify, and it supports multiple hidden layers as well.
For the output from the experts, the paper doesn't specify whether it is the {-1,1} classification label, or the decision value. I choose the classification label here.
However in the end, on the data set svmguide1
, the results from both methods are awful. The output from the neural network tent to be 1 or -1, or extremely similar values like -0.99999, 0.9999 etc. It seems they are not fitting the loss function at all.
In order to find where I did wrong, do you mind to briefly check my methods, expecially the mathematical details? The first strategy is in the end not fitting the y_i, therefore I am afraid I am using a wrong induction.
I will also check the math induction, as well as the modified neuralnet
code, to make sure I am calculate the correct thing.
Update: I figured out the reason for the bad output, that is some bugs in the svm training process, instead of the neural network. According to the paper, the SVM experts trained on a subset of data, and predict on the entire data set. But this afternoon some experts predict all the data to a single label. I must be stupid on some details but I just cannot fix it :(
Now as an temporal solution, I switch the SVM experts to logistic experts, and the output is the probability. Now the neural network works! The neural network is trained with the first strategy (hack the error term) described previously. On the svmguide1
data set with naive parameters it reaches an accuracy of 80%~81%, as a normal score. See the examples in the documentation of gaterSVM
.
My current (and limited) observation of the output from this gater neural network is:
This is much simpler than I expected. In this sense, the iteration of re-assigning data points to experts is somewhat meaningless: there's almost no difference between weights, and the "sort experts according to weights" step does not matter much.
I found that the matlab code of dcSVM could be run on my Windows machine. And it turns out to work well with one-class subproblems. My transportation must be doing wrong at some point. I am comparing the copy-and-paste code in my fork of e1071, but have no clue yet.
tong, one source of a problem might be that your cluster-predict method might not working the way you think. i think you are re-clustering instead of assigning points to the previously found clusters.
I think in the original paper the author talked about the 'early prediction', and this line of code refers to this method. I used the generated center so that it is assigning the label according to the existing centers.
Currently I have not tested this feature yet. I am just testing the final single SVM model. And I found out that the final model (trained based on the previous result of alpha) works much worse than simply train the model without any prior alpha. It might because I manually set the alpha for subproblems with only one class. I will find out how does the matlab code deal with this problem.
Hello Aydin and Bernd, do you know how to reproduce the environment in R CMD check --as-cran
?
In the process of checking example code in the document, both local and travis tests get errors. But I cannot reproduce the errors. Besides, the local and travis tests give me different errors. The one on travis is about my modified svm
(which is called alphasvm
now), and the local one is about a segmentation fault when calling LiblineaR
. They can be reproduced by R CMD check --as-cran
everytime.
If I can reproduce them, then I think there's a big chance to fix them. Thank you!
Well what happens if you simply run the code without R CMD check? Does that throw an error as well?
tong, i am completely puzzled again, which tests you did run locally vs travis/cran check. what i saw in the some minutes: a) cran check crashes with clusteredSVM. this seems to stem from the iris example in the header. this makes debugging hard, i suppose not only for me. for tests there is testthat. b) when putting the iris example into another file and executing this, i get no error when it is executed. c) when calling the same script with valgrind check "R -d valgrind", then i see an overflow error in some erf function. this keeps me from continuing debugging. d) it seems like a simple library(RcppMLPACK) produces that error. so something about that package looks cheesy. at least it keeps me from valgrind debugging. it might be that RcppMLPACK destroys some memory and that subsequent calls are therefore broken. try to remove the RcppMLPACK dependency go back to stats::kmeans and see, if it is still a problem.
@berndbischl If I run the code in R, there's no error. @aydindemircioglu Seems now we have 3 different errors in total on different platforms.
R CMD check --as-cran
on a CentOS 6 machine with R 3.1.2. And the error message is actually related to LiblineaR
(I have updated it to the latest version 1.94). * checking examples ... ERROR
Running examples in ‘SwarmSVM-Ex.R’ failed
The error most likely occurred in:
> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: clusterSVM
> ### Title: Clustered Support Vector Machine
> ### Aliases: clusterSVM
>
> ### ** Examples
>
> data(iris)
> x=iris[,1:4]
> y=factor(iris[,5])
> train=sample(1:dim(iris)[1],100)
>
> xTrain=x[train,]
> xTest=x[-train,]
> yTrain=y[train]
> yTest=y[-train]
>
> csvm.obj = clusterSVM(x = xTrain, y = yTrain, sparse = FALSE,
+ centers = 2, iter.max = 1000,
+ valid.x = xTest,valid.y = yTest)
Time for Clustering: 0.00699999999999967 secs
Time for Transforming: 0.00199999999999978 secs
*** caught segfault ***
address 0x21000013, cause 'memory not mapped'
Traceback:
1: .C("trainLinear", as.double(W_ret), as.integer(labels_ret), as.double(if (sparse) data@ra else t(data)), as.double(yC), as.integer(n), as.integer(p), as.integer(sparse), as.integer(if (sparse) data@ia else 0), as.integer(if (sparse) data@ja else 0), as.double(b), as.integer(type), as.double(cost), as.double(epsilon), as.double(svr_eps), as.integer(nrWi), as.double(Wi), as.integer(WiLabels), as.integer(cross), as.integer(verbose), PACKAGE = "LiblineaR")
2: LiblineaR(data = tilde.x, target = y, type = type, cost = cost, epsilon = epsilon, svr_eps = svr.eps, bias = bias, wi = wi, cross = 0, verbose = (verbose >= 2))
3: clusterSVM(x = xTrain, y = yTrain, sparse = FALSE, centers = 2, iter.max = 1000, valid.x = xTest, valid.y = yTest)
aborting ...
alphasvm
. I occasionally meet this error when I am running R, but I cannot stably reproduce it. Therefore I am eager to figure out a method to get my the environment in R CMD check --as-cran
, so that I can reproduce it and debug the function.RcppMLPACK
by valgrind
, on either CentOS 6 or Ubuntu 14.04. Can you tell me what is your environment for the test? Thank you.-R version 3.1.2 (2014-10-31) -ubuntu 15.04 64bit -[1] Rcpp_0.11.6 RcppMLPACK_1.0.10-2
i can reproduce that on a ubuntu 15.04 32bit in a virtual box, same version of libraries.
what does valgrind say about your alphasvm problem?
interesting, at least for dense vectors it seems to be much slower, roughly 50x for vectors of size a million.