scikit-learn-contrib / py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
http://contrib.scikit-learn.org/py-earth/
BSD 3-Clause "New" or "Revised" License
457 stars 122 forks source link

Examples failing, or Not Compatible with Windows #36

Closed Jenders74 closed 10 years ago

Jenders74 commented 11 years ago

Both of the usage examples result in the same error for me..

import numpy import cProfile from pyearth import Earth from matplotlib import pyplot

numpy.random.seed(2) m = 1000 n = 10 X = 80_numpy.random.uniform(size=(m,n)) - 40 y = numpy.abs(X[:,6] - 4.0) + 1_numpy.random.normal(size=m) model = Earth(max_degree = 1) model.fit(X,y) Traceback (most recent call last): File "", line 1, in File "C:\Python27\lib\site-packages\pyearth\earth.py", line 312, in fit self.forward_pass(X, y) File "C:\Python27\lib\site-packages\pyearth\earth.py", line 383, in forward_pass forward_passer = ForwardPasser(X, y, **args) File "_forward.pyx", line 67, in pyearth._forward.ForwardPasser.init (pyearth/_forward.c:3146) File "_forward.pyx", line 96, in pyearth._forward.ForwardPasser.init_linear_variables (pyearth/_forward.c:3698) ValueError: Buffer dtype mismatch, expected 'INT_t' but got 'long long'

jcrudy commented 11 years ago

I am guessing the problem is that numpy on your system uses a different data type for integers than the numpy on my system (and, so far, all other users' systems). Would you mind sending me your system information including:

  1. operating system (and is it 32 or 64 bit)
  2. python version (and is it 32 or 64 bit)
  3. numpy version
  4. anything else you think is relevant (such as fancy BLAS implementations or something, if you're aware of them)

I am no expert in the numpy data types, so I will have to do some research. I would like to come up with a solution that works on all systems.

Right now, you may be able to get it to work by doing the following:

  1. delete your current py-earth installation
  2. install cython
  3. reinstall py-earth with the --cythonize argument: sudo python setup.py install --cythonize
  4. try the example again

If you try that, would you please let me know if it works or not? Thanks so much for reporting this. I had suspected that numpy data types might be a problem on some systems, but wasn't really sure. Now that I have an example of the problem I can work on a solution.

jcrudy commented 11 years ago

After some research, I made a change that I think might fix your problem. Unfortunately I can't test it because it has always worked on all of my systems (mac and 64 bit linux). I believe this problem should be fixed by commit ae663871879b28f453ccde189c06194c5f540981, so I am going to close this issue for now. If you test it out and still have the same problem, please reopen it. Thanks!

Jenders74 commented 11 years ago

Thank you for making the changes. Tried uninstalling the module and re-installing the new master. Also, followed the cython instructions. Totally possible that I messed up an uninstall/re-install but I get the same dtype issue. on my 64 bit Windows 7 machine with Python 2.7.3 -- EPD 7.3-2 (64-bit) and 1.6.1 numpy.

jcrudy commented 11 years ago

I have never tested on Windows, so I'm guessing that's the problem. I'll try to get my hands on a Windows machine this weekend and do some tests. Until then, I'm afraid the best advice I can give you is to either use the earth package in R or use Linux. I'll report here as soon as I've figured out the problem.

BTW, I currently use travis for automated testing. If anyone who reads this knows of a similar service capable of testing in a Windows environment, could you please comment?

rkern commented 11 years ago

The code that is failing is assigning the result of np.argsort() to an INT_t array. np.argsort() returns an array with the type cnp.intp_t (in Cython terms), which corresponds to C's ssize_t, the size of a pointer offset. On 64-bit Windows, this is 64-bits, naturally. However, the size of a cnp.int_t or INT_t (which corresponds to a C long) is still 32-bits (unlike 64-bit Linux and OS X which promote long to 64-bits).

The order array in that method needs to be declared with a type cnp.intp_t instead.

rkern commented 11 years ago

I can confirm that changing INT_t to cnp.intp_t in the declaration of order lets me run both examples.

There are a few other things that need to fixed to get it to compile using Visual Studio on Windows. Namely, Visual Studio is not a C99 compiler, so log2() is not implemented and must be emulated. Also, Visual Studio errors out at compile time at the expression 0 / 0. I just replaced that with np.nan.

jcrudy commented 11 years ago

@rkern, thanks for chiming in! You just saved me at least an hour of dtype guess and check plus a 90 minute drive to my parents' house to use their computer (I should still visit soon, though). I made the changes you suggested and, barring unexpected things, this issue should be repaired as of commit c5385b0f1d736bcf6de75a87dd92c40923bbc015.

I'm closing this issue for now. @Jenders74, if you still have the problem please comment and reopen.

jcrudy commented 10 years ago

I just got an email from a 64 bit Windows user having type problems. It appears that the previous fix did not work for all environments. The following is pasted from the email:

While I was able to use it on my old 32bit Windows machine, it fails to install on my newer 64bit Windows machine.

The compiling of PYX fials with
      ValueError: Buffer dtype mismatch, expected 'INT_t' but got 'long'

It fails on the line in forward.pyx
        cdef cnp.ndarray[INT_t, ndim = 1] linear_variables = <cnp.ndarray[INT_t, ndim = 1] > self.linear_variables
        cdef cnp.ndarray[FLOAT_t, ndim = 2] B = <cnp.ndarray[FLOAT_t, ndim = 2] > self.B
        cdef cnp.ndarray[FLOAT_t, ndim = 2] X = <cnp.ndarray[FLOAT_t, ndim = 2] > self.X

I emailed back for more environment details, and I'll try to reproduce. Reopening this issue until fixed.

jcrudy commented 10 years ago

I've managed to reproduce this bug on EC2 with windows server 2012 using winpython 2.7 64 bit with visual studio 2013 professional. That's a start!

smrtslckr commented 10 years ago

Hello All! First off, thanks jcrudy for taking your time to port this to Python. It is one of the only reasons I find myself heading back to R, since it such a useful method. I just wanted to report that I am experiencing this issue on a Windows 7 64-bit machine myself running Anaconda 2.0 (2.7.6). In any case, if you have an questions about my system, so I can help just let me know. Hopefully, this will check some of the last boxes, so it can be merged with scikit-learn and off your maintainer plate.

jcrudy commented 10 years ago

@Erstwild It was my reason for heading back to R also! I'm glad it's useful to you. Thanks for the info. I'm not sure if it matters, but are you using visual studio, mingw, or something else?

jcrudy commented 10 years ago

I think I've fixed it. The fix is in the issue36 branch. I'd very much appreciate reports on whether it works or not from @Erstwild and anyone else who has experienced problems.

smrtslckr commented 10 years ago

@jcrudy Thanks Jason! Much appreciated. I will let you know if the new branch works in my environments when I get home. I mostly just use ipython notebooks for my work, but I occasionally use PyCharm as well. There is still a wealth of stuff on CRAN in very specific domains that trumps available Python libraries, but when I recently updated rpy2 to the new version, which is apparently only Python 3 compatible now, I have decided to make a concerted effort to start avoiding R, at least in the interim, while I transition some things.

smrtslckr commented 10 years ago

@jcrudy Apologies from the git noob...how would I clone your new branch to evaluate?

jcrudy commented 10 years ago

@Erstwild When you clone a git repository you actually get all the branches. Once you clone the repo, switch to the branch with:

git checkout issue36

smrtslckr commented 10 years ago

@jcrudy Hmm, that is what I did. It returned "Branch issue36 set up to track remote branch issue36 from origin. Switched to a new branch 'issue36'" . I think I was successful. When I go to rerun the example I still get the "ValueError: Buffer dtype mismatch, expected 'INT_t' but got 'long' "

jcrudy commented 10 years ago

@Erstwild You can tell if you are in the correct branch by looking for a file named _types.pyx in the pyearth directory. That file only exists in the issue36 branch, so if it's there you are in the correct branch. Assuming it's there, did you rebuild and reinstall after switching to the branch? You can do so with:

python setup.py build_ext python setup.py install

Just to be safe, you could try deleting py-earth entirely from your system, then clone the repo again, switch to the branch, and reinstall. If you did all that and it still doesn't work, then I guess my fix didn't work on your system. In that case, I would ask for all the details you can provide: CPU, compiler version, anything else you can think of. I think it should work, though, as it worked on my test system and on the system of the person who originally reported this bug.

jcrudy commented 10 years ago

I merged the fix for this issue into master. It works on my test machine, and I have confirmation that it works for the user who originally reported the bug. If @Erstwild or anyone else still has problems, please comment and reopen.