Closed Queuecumber closed 9 years ago
Could you add some code to explain how do you:
var svm = new XXX(options)...
)Debug ideas:
n
subfiles (let say n = 10) node-svm
. one or more may fail.node_modules/node-svm/lib/svm.js
file: before line 145, exports problem to json using fs.writeFileSync('temp.json', JSON.stringify(problem));
Note: for image processing you may use deep learning technics instead of SVM. Main libraries are Theano and caffe.
svm initialization:
var opts = {
type: nodesvm.SvmTypes.C_SVC,
kernel: nodesvm.KernelTypes.LINEAR,
C: 1.0,
probability: true
};
var svmClassifier = new nodesvm.SVM(opts);
The dataset is cached in a redis server in JSON format rather than being stored in a text file. I will dump my labeled features and analyse them as well as training in groups of 10 and let you know the result when this is done.
Thanks for the suggestion, overfeat is in fact a convolutional network, I am stripping off the result at layer 19 to use as a feature for SVM training.
To keep things simple, you should also disable following options during debug :
reduce
option) normalize
option) probability
option) var opts = {
type: nodesvm.SvmTypes.C_SVC,
kernel: nodesvm.KernelTypes.LINEAR,
C: 1.0,
probability: false,
normalize: false,
reduce: false
};
Ok, I'll try that, I'm currently running another round of feature extraction using a smaller encoding because I found that parts of the system were running out of memory. This will complete in around 30 hours so I'll know more then.
Do you think this could be caused by an out of memory error or would it say that instead of segmentation fault? I don't suppose there is a way to operate on the data in a zero copy manner?
It don't know if seg fault error is caused by an out of memory but it is possible.
Can you monitor memory on your server to confirm?
I suggest to check your dataset anyway.
Just wanted to give you an update, I'm currently doing a lot of refactoring to ensure the integrity of my dataset before training, when I get that all done I'll let you know if I still see this problem
Ok so I did a lot of refactoring and my memory usage is right about where I want it. During this process I was testing on some smaller datasets and found that the library crashes on a 5000 sample dataset. I was able to print this dataset out and train it successfully using the libsvm binaries. I am still playing with the sizes to find out exactly where the problems start, but other than that how do you want to proceed?
More weirdness.
I added a module to my program to write out temp files and call the libsvm binaries so that I can start getting results out of this. When I'm testing the pipeline I use a couple smaller sets because they finish quickly. One of them is a 10 sample set the other is a 100 sample set. I found that when I run the 10 sample set using the libsvm binaries I get 0% testing accuracy and when I run it with your library I get 40% accuracy. When I run the 100 sample set they both get 75% accuracy. Any idea what might cause this?
By the way I don't think I ever posted this but the code is available here if you want to review it to see exactly what is happening.
Accuracy can vary significantly on small dataset due to:
For consistency you should use dataset with more than 50 examples.
About segfault error,can you :
node_modules/node-svm/lib/svm.js
file: before line 145, exports problem to json using fs.writeFileSync('temp.json', JSON.stringify(problem));
temp.json
)The accuracy doesn't seem to vary though, when I use your library on the 10 sample set i always get 40%, when I use libsvm binaries i always get 0% though now that I think of it there may be some default settings that aren't the same between two as I'm using them, i'll get on that.
I'll add code to write out the JSON and see if I can get it to you
Alright so there were some settings I think I was using your defaults for that I shouldn't have been. New initialization code is here
var opts = {
type: nodesvm.SvmTypes.C_SVC,
kernel: nodesvm.KernelTypes.LINEAR,
C: 1.0,
probability: true,
normalize: false,
reduce: false,
nFold: '???',
cacheSize: '???'
};
When I run the libsvm binaries I don't use cross validation (unless some kind of cross validation is on be default), how do I turn if off?
What is the cacheSize setting? That sounds pretty relevant for a giant dataset like i have (not sure how I managed to overlook this).
As you mention, node-svm provides high level stuffs by default such as grid search, PCA, mean-normalization (libsvm binaries don't).
To disable them (even if I'm not sure why you would like to do so), you should use a lower level interface (not documented right now). Initialization code become:
var svm = new nodesvm.BaseSVM({
type: nodesvm.SvmTypes.C_SVC,
kernel: new nodesvm.LinearKernel(),
C: 1.0,
eps: 1e-3, // stopping criteria
cacheSize: : 100, // in MB
shrinking : 1, // always use the shrinking heuristics
probability : 0 // {0: false, 1: true}
});
svm.train(trainingDataset); // train svm synchronously using training set
svm.evaluate(testDataset, function(result){ // evaluate model's accuracy against testing set
console.log(JSON.stringify(result, null, 2);
});
Note:
-m cachesize : set cache memory size in MB (default 100)
. You should try greater values.There's a bunch of reasons why I don't need the extra stuff
I will switch to your basesvm class. Does this expose any of the classification progress events? I have seen them in some documentation but I noticed that they aren't in the SVM class itself.
Is the default 100MB cache the same default that the libsvm binaries use?
The segmentation fault is not happening when I use the libsvm binaries so I don't think it is on their end. Since I was able to get rid of the crash using the my system and simply writing the same data to a file and calling the libsvm binaries, it is unlikely that the crash is on my end either (though it certainly isnt ruled out), it is looking like something in your library. Hopefully removing the high level interface will fix this.
BaseSVM do not expose any progress during training. The default cache size is 100MB, same as libsvm binaries (you should try greater values).
Let me know if you find a solution for the seg fault error. BR
I have some good news for you, I don't think the issue is with your library anymore. I've been doing a lot of debugging in this lately and I managed to find that some of the images are giving back feature vectors that are not 4096 dimensional, so I am now looking into the feature extractor to figure out why that is.
My guess is that it works for the libsvm file because the format is sparse so it was automatically assuming 0 for the missing dimensions (which for whatever reason worked well). Since your library (rightly) requires the full feature vector, it was failing for the same data.
I will keep looking into this and let you know what I find.
Tx for your feedback! Best regards
I'm getting a segmentation fault when attempting to train with a rather large dataset.
I have extracted overfeat features from the mnist dataset. When I try to train on a subset of this data using your library, things work fine. If I manually stick the full dataset into a text file in the libsvm format and run the libsvm executables for training, it also works fine. However, if I try to train on the full dataset using your library, I get a segmentation fault.
Any idea how I can help debug this and/or verify that I am doing things correctly?
Just for reference, each overfeat feature has 4,096 dimensions and there are 60,000 of them for the full mnist training set.