Segfault for large dataset

Queuecumber commented 10 years ago

I'm getting a segmentation fault when attempting to train with a rather large dataset.

I have extracted overfeat features from the mnist dataset. When I try to train on a subset of this data using your library, things work fine. If I manually stick the full dataset into a text file in the libsvm format and run the libsvm executables for training, it also works fine. However, if I try to train on the full dataset using your library, I get a segmentation fault.

Any idea how I can help debug this and/or verify that I am doing things correctly?

Just for reference, each overfeat feature has 4,096 dimensions and there are 60,000 of them for the full mnist training set.

nicolaspanel commented 10 years ago

Could you add some code to explain how do you:

initialze svm object and which options you use (Example : var svm = new XXX(options)...)
deserialize dataset from text file.

Debug ideas:

The error may be due to an invalid parsing happening only on few examples =>
- Split your text file into n subfiles (let say n = 10)
- Try to train each one with node-svm. one or more may fail.
May be a bug in dataset preprocessing (Normalisation and/or PCA) =>
- edit node_modules/node-svm/lib/svm.js file: before line 145, exports problem to json using fs.writeFileSync('temp.json', JSON.stringify(problem));
- look for invalid data such as NaN, null, undefined...

Note: for image processing you may use deep learning technics instead of SVM. Main libraries are Theano and caffe.

Queuecumber commented 10 years ago

svm initialization:

var opts = {
    type: nodesvm.SvmTypes.C_SVC,
    kernel: nodesvm.KernelTypes.LINEAR,
    C: 1.0,
    probability: true
};

var svmClassifier = new nodesvm.SVM(opts);

The dataset is cached in a redis server in JSON format rather than being stored in a text file. I will dump my labeled features and analyse them as well as training in groups of 10 and let you know the result when this is done.

Thanks for the suggestion, overfeat is in fact a convolutional network, I am stripping off the result at layer 19 to use as a feature for SVM training.

nicolaspanel commented 10 years ago

To keep things simple, you should also disable following options during debug :

PCA (reduce option)
Normalization (normalize option)
Probabilities (probability option)

var opts = {
    type: nodesvm.SvmTypes.C_SVC,
    kernel: nodesvm.KernelTypes.LINEAR,
    C: 1.0,
    probability: false,
    normalize: false,       
    reduce: false
};

Queuecumber commented 10 years ago

Ok, I'll try that, I'm currently running another round of feature extraction using a smaller encoding because I found that parts of the system were running out of memory. This will complete in around 30 hours so I'll know more then.

Do you think this could be caused by an out of memory error or would it say that instead of segmentation fault? I don't suppose there is a way to operate on the data in a zero copy manner?

nicolaspanel commented 10 years ago

It don't know if seg fault error is caused by an out of memory but it is possible.

Can you monitor memory on your server to confirm?

I suggest to check your dataset anyway.

Queuecumber commented 10 years ago

Just wanted to give you an update, I'm currently doing a lot of refactoring to ensure the integrity of my dataset before training, when I get that all done I'll let you know if I still see this problem

Queuecumber commented 10 years ago

Ok so I did a lot of refactoring and my memory usage is right about where I want it. During this process I was testing on some smaller datasets and found that the library crashes on a 5000 sample dataset. I was able to print this dataset out and train it successfully using the libsvm binaries. I am still playing with the sizes to find out exactly where the problems start, but other than that how do you want to proceed?

Queuecumber commented 10 years ago

More weirdness.

I added a module to my program to write out temp files and call the libsvm binaries so that I can start getting results out of this. When I'm testing the pipeline I use a couple smaller sets because they finish quickly. One of them is a 10 sample set the other is a 100 sample set. I found that when I run the 10 sample set using the libsvm binaries I get 0% testing accuracy and when I run it with your library I get 40% accuracy. When I run the 100 sample set they both get 75% accuracy. Any idea what might cause this?

By the way I don't think I ever posted this but the code is available here if you want to review it to see exactly what is happening.

nicolaspanel commented 10 years ago

Accuracy can vary significantly on small dataset due to:

n-fold cross validation
mean-normalization
PCA reduction

For consistency you should use dataset with more than 50 examples.

About segfault error,can you :

edit node_modules/node-svm/lib/svm.js file: before line 145, exports problem to json using fs.writeFileSync('temp.json', JSON.stringify(problem));
look for invalid data such as NaN, null, undefined...
send me the generated file (temp.json)

Queuecumber commented 10 years ago

The accuracy doesn't seem to vary though, when I use your library on the 10 sample set i always get 40%, when I use libsvm binaries i always get 0% though now that I think of it there may be some default settings that aren't the same between two as I'm using them, i'll get on that.

I'll add code to write out the JSON and see if I can get it to you

Queuecumber commented 10 years ago

Alright so there were some settings I think I was using your defaults for that I shouldn't have been. New initialization code is here

var opts = {
    type: nodesvm.SvmTypes.C_SVC,
    kernel: nodesvm.KernelTypes.LINEAR,
    C: 1.0,
    probability: true,
    normalize: false,
    reduce: false,
    nFold: '???',
    cacheSize: '???'
};

When I run the libsvm binaries I don't use cross validation (unless some kind of cross validation is on be default), how do I turn if off?

What is the cacheSize setting? That sounds pretty relevant for a giant dataset like i have (not sure how I managed to overlook this).

nicolaspanel commented 10 years ago

As you mention, node-svm provides high level stuffs by default such as grid search, PCA, mean-normalization (libsvm binaries don't).

To disable them (even if I'm not sure why you would like to do so), you should use a lower level interface (not documented right now). Initialization code become:

var svm = new nodesvm.BaseSVM({
      type: nodesvm.SvmTypes.C_SVC,
      kernel: new nodesvm.LinearKernel(),
      C: 1.0,
      eps: 1e-3, // stopping criteria 
      cacheSize: : 100,                 // in MB 
      shrinking   : 1, // always use the shrinking heuristics
      probability :  0 // {0: false, 1: true}
    });
svm.train(trainingDataset); // train svm  synchronously using training set 
svm.evaluate(testDataset, function(result){ // evaluate  model's  accuracy  against testing set
  console.log(JSON.stringify(result, null, 2);
});

Note:

BaseSVM object should have the exact same behavior than libsvm binaries.
Cache size refers to libsvm option:-m cachesize : set cache memory size in MB (default 100). You should try greater values.
See faq for more information about segfault error, more specifically :
- http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f210
- http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f416

Queuecumber commented 10 years ago

There's a bunch of reasons why I don't need the extra stuff

I have no use for normalization, overfeat is giving me comparable feature vectors for each image
I have no use for PCA, I have no reason to believe that the data is rotated, even if it was rotated, aligning it based on the directions of maximum variance is not guaranteed to produce good classification results, reducing dimensionality may speed up the classification process and use less memory but this would hurt the accuracy
I don't need a grid search, the only parameter I would be changing is C since I am using a linear svm and I don't expect this to effect my result as the features are already very much linearly separable (I think a perceptron would even perform well on this data)

I will switch to your basesvm class. Does this expose any of the classification progress events? I have seen them in some documentation but I noticed that they aren't in the SVM class itself.

Is the default 100MB cache the same default that the libsvm binaries use?

The segmentation fault is not happening when I use the libsvm binaries so I don't think it is on their end. Since I was able to get rid of the crash using the my system and simply writing the same data to a file and calling the libsvm binaries, it is unlikely that the crash is on my end either (though it certainly isnt ruled out), it is looking like something in your library. Hopefully removing the high level interface will fix this.

nicolaspanel commented 10 years ago

BaseSVM do not expose any progress during training. The default cache size is 100MB, same as libsvm binaries (you should try greater values).

Let me know if you find a solution for the seg fault error. BR

Queuecumber commented 9 years ago

I have some good news for you, I don't think the issue is with your library anymore. I've been doing a lot of debugging in this lately and I managed to find that some of the images are giving back feature vectors that are not 4096 dimensional, so I am now looking into the feature extractor to figure out why that is.

My guess is that it works for the libsvm file because the format is sparse so it was automatically assuming 0 for the missing dimensions (which for whatever reason worked well). Since your library (rightly) requires the full feature vector, it was failing for the same data.

I will keep looking into this and let you know what I find.

nicolaspanel commented 9 years ago

Tx for your feedback! Best regards

nicolaspanel / node-svm

Segfault for large dataset #6