scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
58.81k stars 25.13k forks source link

Program crash on data loaded by load_svmlight_file #479

Closed fannix closed 12 years ago

fannix commented 12 years ago

I load the dataset with load_svmlight_file()

Xc, yc = load_svmlight_file("../ch.vec", ch_feature)

and then fit it with

nb = MultinomialNB()
nb.fit(Xc, yc)

And then the program crashed and no exception and warning message is displayed.

This problem is caused by off-by-one error. My feature index starts from 1, while load_svmlight_file() requires the index to start from 0. However, this exception behavior is quite confusing and very hard to debug.

GaelVaroquaux commented 12 years ago

The right way to deal with this is probably to add a check for this error, and raise an Exception.

Of course this is a generic answer, and I haven't looked at the code, so I don't know how the check should be implemented :)

mblondel commented 12 years ago

Can you paste here a minimal dataset that reproduces the problem? The reason it starts at 1 is for allowing a dummy feature at 0 (to replace the bias).

ogrisel commented 12 years ago

Is the crash a segmentation fault triggered by the call MultinomialNB.fit? I don't understand how it can happen.

fannix commented 12 years ago

Yes. This segmentation happens on fit() method. I don't know if it is related, but my working environment is Enthought distribution on Windows. I haven't try on other platforms.

On Tue, Dec 20, 2011 at 7:37 PM, Olivier Grisel < reply@reply.github.com

wrote:

Is the crash a segmentation fault triggered by the call MultinomialNB.fit? I don't understand how it can happen.


Reply to this email directly or view it on GitHub:

https://github.com/scikit-learn/scikit-learn/issues/479#issuecomment-3217724

Best Wishes

Meng Xinfan(蒙新泛) Institute of Computational Linguistics Department of Computer Science & Technology School of Electronic Engineering & Computer Science Peking University Beijing, 100871 China

fannix commented 12 years ago

OK, I'll try to reproduce the error later. I am still not sure if it is platform-dependent.

On Mon, Dec 19, 2011 at 8:43 PM, Mathieu Blondel < reply@reply.github.com

wrote:

Can you paste here a minimal dataset that reproduces the problem? The reason it starts at 1 is for allowing a dummy feature at 0 (to replace the bias).


Reply to this email directly or view it on GitHub:

https://github.com/scikit-learn/scikit-learn/issues/479#issuecomment-3202683

Best Wishes

Meng Xinfan(蒙新泛) Institute of Computational Linguistics Department of Computer Science & Technology School of Electronic Engineering & Computer Science Peking University Beijing, 100871 China

GaelVaroquaux commented 12 years ago

On Tue, Dec 20, 2011 at 05:33:59AM -0800, Meng Xinfan wrote:

Yes. This segmentation happens on fit() method. I don't know if it is related, but my working environment is Enthought distribution on Windows. I haven't try on other platforms.

That's probably because during the fit() method some code is trying to access memory that has been freed by mistake.

G

fannix commented 12 years ago

This is the minimum dataset en.vec

-1 488:1 18:1 46:1 248:1 547:1 5665:1 648:1 40:2 44:1 6:1 48:1 1873:1 4958:1 4279:1 12:1 88:1 2:3 445:1 16:1 1243:1 1237:1

And this is the script crash the Python

import sklearn
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import load_svmlight_file

X, y = load_svmlight_file("d:/en.vec", 4957)
nb = MultinomialNB()
nb.fit(X, y)
larsmans commented 12 years ago

I don't get a segfault for this at all. I do get a different error due to there being only one class in the input.

@fannix: could you try this again with the current master? We did fix a bug in the SVMlight loader yesterday.

fannix commented 12 years ago

I would try that on Monday. However, I think it is reasonable to raise an exception in such cases, since incompatible dimensions would surely lead to later errors.

fannix commented 12 years ago

The problem still exists. However, the behavior is not deterministic. For some data, sometimes it crashed, sometimes it didn't. I couldn't find the pattern.

ogrisel commented 12 years ago

I ran the sequence data load + fit 100 times in a loop I don't get any segfault but for each fit I get a :

Warning: divide by zero encountered in log
larsmans commented 12 years ago

I get the division by zero message too. That's caused by trying to fit an NB estimator to a single sample/label.

fannix commented 12 years ago

I guess this problem is platform dependent. I also don't get segmentation fault in MacOSX.

amueller commented 12 years ago

Can any one reproduce this problem?

larsmans commented 12 years ago

@fannix, can you try again with the latest version?

buma commented 11 years ago

I get Segmentation fault. with example that was posted. I'm using Arch linux kernel 3.4.7-1-pae 32 bit. Scikits-learn version 0.11-2.

Sometimes I get Segfault and sometimes stacktrace which starts like that: /usr/lib/python2.7/site-packages/sklearn/naive_bayes.py:269: RuntimeWarning: divide by zero encountered in log self.class_logprior = np.log(y_freq) - np.log(y_freq.sum()) * glibc detected * python2: double free or corruption (!prev): 0x09af7dd0 *** ======= Backtrace: ========= /lib/libc.so.6(+0x72702)[0xb74b1702] /usr/lib/python2.7/site-packages/numpy/core/multiarray.so(+0x8ff8b)[0xb7045f8b] /usr/lib/python2.7/site-packages/numpy/core/multiarray.so(+0x8ffd9)[0xb7045fd9] /lib/libpython2.7.so.1.0(+0x512cc)[0xb76512cc] /lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x529)[0xb76c5d29] /lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4d84)[0xb76c4234] /lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x83d)[0xb76c603d] /lib/libpython2.7.so.1.0(PyEval_EvalCode+0x63)[0xb76c61b3] /lib/libpython2.7.so.1.0(+0xdeeca)[0xb76deeca] /lib/libpython2.7.so.1.0(PyRun_FileExFlags+0x9b)[0xb76dfe6b] /lib/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0xe1)[0xb76e09e1] /lib/libpython2.7.so.1.0(PyRun_AnyFileExFlags+0x88)[0xb76e1658] /lib/libpython2.7.so.1.0(Py_Main+0xd22)[0xb76f2932] python2(main+0x27)[0x8048587] /lib/libc.so.6(__libc_start_main+0xf5)[0xb7458605]

It crashes inside fit in first line: self.feature_logprob = (np.log(N_c_i + self.alpha)

mblondel commented 11 years ago

Can you try with the git master version of scikit-learn?

buma commented 11 years ago

I installed master version from git and it's the same problem. I can also try valgrind or other things if they might help.

I'm doing something and on part of training set everything was OK but with full training set I have a similar error. In fit I can do N_c_i + self.alpha, mine crashes when I do np.log(N_c_i + self.alpha). I tried saving variables with pickle, to see where is the problem, but pickle goes out of memory. I have 8 GB. Full training set pickled is sparse array 800 MB 175 315 x 592 158.

I have using numpy 1.6.2-1. I thing the problem might be in numpy.

buma commented 11 years ago

I have run valgrind with suggested suppressions file on a example:

==22029== Memcheck, a memory error detector
==22029== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==22029== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==22029== Command: python2 -E -tt test.py
==22029== 
==22029== Invalid read of size 8
==22029==    at 0x6EC52DD: void csc_matvecs<int, double>(int, int, int, int const_, int const_, double const_, double const_, double_) (in /usr/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csc.so)
==22029==    by 0x6E7B219: ??? (in /usr/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csc.so)
==22029==    by 0x6E975BF: ??? (in /usr/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csc.so)
==22029==    by 0x40CB9C5: PyCFunction_Call (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x41280EC: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x41290FA: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x40B6A2F: function_call (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x409025F: PyObject_Call (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x409F8AB: instancemethod_call (in /usr/lib/libpython2.7.so.1.0)
==22029==  Address 0x74f8e38 is not stack'd, malloc'd or (recently) free'd
==22029== 
==22029== Invalid write of size 8
==22029==    at 0x6EC52E0: void csc_matvecs<int, double>(int, int, int, int const_, int const_, double const_, double const_, double_) (in /usr/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csc.so)
==22029==    by 0x6E7B219: ??? (in /usr/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csc.so)
==22029==    by 0x6E975BF: ??? (in /usr/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csc.so)
==22029==    by 0x40CB9C5: PyCFunction_Call (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x41280EC: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x41290FA: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x40B6A2F: function_call (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x409025F: PyObject_Call (in /usr/lib/libpython2.7.so.1.0)
==22029==    by 0x409F8AB: instancemethod_call (in /usr/lib/libpython2.7.so.1.0)
==22029==  Address 0x74f8e38 is not stack'd, malloc'd or (recently) free'd
==22029== 
==22029== 
==22029== HEAP SUMMARY:
==22029==     in use at exit: 8,508,788 bytes in 6,010 blocks
==22029==   total heap usage: 176,198 allocs, 170,188 frees, 536,218,183 bytes allocated
==22029== 
==22029== LEAK SUMMARY:
==22029==    definitely lost: 0 bytes in 0 blocks
==22029==    indirectly lost: 0 bytes in 0 blocks
==22029==      possibly lost: 194,597 bytes in 275 blocks
==22029==    still reachable: 8,314,175 bytes in 5,734 blocks
==22029==         suppressed: 16 bytes in 1 blocks
==22029== Rerun with --leak-check=full to see details of leaked memory
==22029== 
==22029== For counts of detected and suppressed errors, rerun with: -v
==22029== ERROR SUMMARY: 8 errors from 2 contexts (suppressed: 16431 from 1190)
buma commented 11 years ago

I compiled scipy from git and get the same errors and same valgrind dump: ==5101== Invalid read of size 8 ==5101== at 0x6E78496: _wrap_csc_matvecs__SWIG_10.isra.48 (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csc.so) ==5101== by 0x6E9D18F: _wrap_csc_matvecs (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csc.so) ==5101== by 0x40CB9C5: PyCFunction_Call (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x41280EC: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x41290FA: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x40B6A2F: function_call (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x409025F: PyObject_Call (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x409F8AB: instancemethod_call (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x409025F: PyObject_Call (in /usr/lib/libpython2.7.so.1.0) ==5101== Address 0x4a0a280 is not stack'd, malloc'd or (recently) free'd ==5101== ==5101== Invalid write of size 8 ==5101== at 0x6E78499: _wrap_csc_matvecs__SWIG_10.isra.48 (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csc.so) ==5101== by 0x6E9D18F: _wrap_csc_matvecs (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csc.so) ==5101== by 0x40CB9C5: PyCFunction_Call (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x41280EC: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x41290FA: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x40B6A2F: function_call (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x409025F: PyObject_Call (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x409F8AB: instancemethod_call (in /usr/lib/libpython2.7.so.1.0) ==5101== by 0x409025F: PyObject_Call (in /usr/lib/libpython2.7.so.1.0) ==5101== Address 0x4a0a280 is not stack'd, malloc'd or (recently) free'd ==5101== ==5101== ==5101== HEAP SUMMARY: ==5101== in use at exit: 8,324,020 bytes in 6,100 blocks ==5101== total heap usage: 163,188 allocs, 157,088 frees, 482,700,798 bytes allocated ==5101== ==5101== LEAK SUMMARY: ==5101== definitely lost: 0 bytes in 0 blocks ==5101== indirectly lost: 0 bytes in 0 blocks ==5101== possibly lost: 196,725 bytes in 281 blocks ==5101== still reachable: 8,127,279 bytes in 5,818 blocks ==5101== suppressed: 16 bytes in 1 blocks ==5101== Rerun with --leak-check=full to see details of leaked memory ==5101== ==5101== For counts of detected and suppressed errors, rerun with: -v ==5101== ERROR SUMMARY: 8 errors from 2 contexts (suppressed: 14715 from 1061)

mblondel commented 11 years ago

I can reproduce with the example given in https://github.com/scikit-learn/scikit-learn/issues/479#issuecomment-3229270

mblondel commented 11 years ago

There are two problems in https://github.com/scikit-learn/scikit-learn/issues/479#issuecomment-3229270.

First, the indices in the dataset are not sorted. Our implementation, svmlight and libsvm all assume that the features are sorted. In our implementation, in the inner loop, we could check that the index at time t is always greater than the index at time t-1 and raise an exception if needed. Not sure if this is gonna incur a performance penalty.

Second, the explicitly-passed dimensionality should be 4958, not 4957. (But this is not what causes the segfault)

mblondel commented 11 years ago

Pushed a fix in 6f54691eb3783bc8ff875201f7d32fb2cc457100.

@fannix @buma If you feel like it, a PR adding a sort parameter to load_svmlight_file and load_svmlight_files (defaulting to False) would be nice.

In the multi-label case, I see that the labels are sorted. It may be better to sort them only if the above option is set to True.

buma commented 11 years ago

Thank you very much for fix. I'll try to add a parameter, but I am currently only a user.

mblondel commented 11 years ago

Can you confirm that it fixed your problem? If so, I would suggest to close this issue.

buma commented 11 years ago

Saddly it didn't fixed my problem. At first it seemed the same error, but now I figured out that my problem gives different stacktrace. I'll put new issue for it. I'm working on minimal dataset with error. This can definitely be closed.

buma commented 11 years ago

I created an example, and figured out that my biggest problem cv only works for 3 folds not more is fixed If I use highest protocol when writing data I needed.

Other problems are that chi2 gives memory error, and cross validation doens't work for multiple jobs. The problem is that test data is big. 260 MB. I'm uploading it to Dropbox right now.

amueller commented 11 years ago

@buma It seems you have some memory issues. Could you say what exactly is your problem?

buma commented 11 years ago

I have finally succeeded in creating example that also segfaults.

The problem is that program segfaults if I change cv to 5 in here: scores = cross_val_score(classifier, X_train, y_train, cv=5, n_jobs=1)

primer.py X_all.pickle y.pickle

Valgrind dump: ==32594== Warning: set address range perms: large range [0x18851018, 0x28851038) (noaccess) ==32594== Warning: set address range perms: large range [0x8c4b028, 0x1c3252a0) (undefined) ==32594== Warning: set address range perms: large range [0x1c326028, 0x2fa002b1) (undefined) ==32594== Warning: set address range perms: large range [0x1c326018, 0x2fa002c1) (noaccess) ==32594== Warning: set address range perms: large range [0x8c4b018, 0x1c3252b0) (noaccess) ==32594== Warning: set address range perms: large range [0x8c4b028, 0x2726235c) (undefined) ==32594== Warning: set address range perms: large range [0x76b9f028, 0x951b636d) (undefined) ==32594== Warning: set address range perms: large range [0x76b9f018, 0x951b637d) (noaccess) ==32594== Warning: set address range perms: large range [0x8c4b018, 0x2726236c) (noaccess) ==32594== Warning: set address range perms: large range [0x38d41018, 0x58d41038) (noaccess) Loaded X Loaded y SPlit ==32594== Warning: set address range perms: large range [0x7389b028, 0x8eed5318) (undefined) ==32594== Invalid read of size 4 ==32594== at 0x4F3960B: trivial_three_operand_loop (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so) ==32594== by 0x4F4EEF7: PyUFunc_GenericFunction (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so) ==32594== by 0x4F4F1BA: ufunc_generic_call (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so) ==32594== by 0x409025F: PyObject_Call (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x4090347: call_function_tail (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x409045F: _PyObject_CallFunction_SizeT (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x4E67B80: PyArray_GenericBinaryFunction (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so) ==32594== by 0x408BE25: binary_op1 (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x408DC35: PyNumber_Add (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x4125781: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0) ==32594== Address 0x1c is not stack'd, malloc'd or (recently) free'd ==32594== ==32594== ==32594== Process terminating with default action of signal 11 (SIGSEGV) ==32594== Access not within mapped region at address 0x1C ==32594== at 0x4F3960B: trivial_three_operand_loop (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so) ==32594== by 0x4F4EEF7: PyUFunc_GenericFunction (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so) ==32594== by 0x4F4F1BA: ufunc_generic_call (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so) ==32594== by 0x409025F: PyObject_Call (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x4090347: call_function_tail (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x409045F: _PyObject_CallFunction_SizeT (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x4E67B80: PyArray_GenericBinaryFunction (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so) ==32594== by 0x408BE25: binary_op1 (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x408DC35: PyNumber_Add (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x4125781: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0) ==32594== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)

This is on yesterday's master scikit, scipy, numpy.

mblondel commented 11 years ago

Is it related to load_svmlight_file or not? If yes, you need to provide the svmlight file, not the pickle file.

In any case, please create a separate issue.

buma commented 11 years ago

I thought it is related but valgrind showed different problem.

I have created separate issue: https://github.com/scikit-learn/scikit-learn/issues/998

mblondel commented 11 years ago

@buma: Thanks.