scikit-learn-contrib / py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
http://contrib.scikit-learn.org/py-earth/
BSD 3-Clause "New" or "Revised" License
458 stars 121 forks source link

Forward pass adds same linear basis function #195

Open CherylCB opened 5 years ago

CherylCB commented 5 years ago

I'm trying to run a pyearth model with enable_pruning = False and only linear features with a max_degree of 1. I noticed that the same feature is added twice. See results below:

-------------------------------------------------------------------------
iter  parent  var  knot  mse            terms  gcv         rsq    grsq   
-------------------------------------------------------------------------
0     -       -    -     124390.433112  1      125669.504  0.000  0.000  
1     0       11   -1    58269.498543   2      60407.652   0.532  0.519  
2     0       12   -1    46309.037751   3      49280.000   0.628  0.608  
3     0       9    -1    43489.795470   4      47522.247   0.650  0.622  
4     0       10   -1    40593.709662   5      45564.586   0.674  0.637  
5     0       8    -1    38274.623612   6      44146.607   0.692  0.649  
6     0       1    -1    36957.981690   7      43820.303   0.703  0.651  
7     0       7    -1    34190.204492   8      41688.582   0.725  0.668  
8     0       12   -1    33603.237796   9      42151.901   0.730  0.665  
-------------------------------------------------------------------------

I noticed issue #135 which seems to suggest the same bug, addressed by @jcrudy . I'm installing from the latest commit git+https://github.com/scikit-learn-contrib/py-earth.git@b209d1916f051dbea5b142af25425df2de469c5a#egg=sklearn-contrib-py-earth

jcrudy commented 5 years ago

@CherylCB Thanks for reporting this. I will look into it as soon as I can, which might be a while. It's annoying, but if this bug is causing problems for you the best workaround might be to add your own code to prune any duplicate basis functions. If you want to do that and need guidance on where to start, please comment here and I can elaborate.

It seems you're using py-earth essentially as a variable selection mechanism for linear regression. Is that right? If so, you could also just extract the set of selected variables and use them with sklearn.LinearRegression.

CherylCB commented 5 years ago

Thanks for your answer @jcrudy. For now indeed I have added my own code to workaround the problem of adding duplicate basis functions.

jcrudy commented 5 years ago

@CherylCB Glad you were able to work around this bug. Please feel free to post code for your workaround in this thread if it's shareable. It might help someone out later. I'll be leaving this issue open until it's fixed.

CherylCB commented 5 years ago

@jcrudy I have some time coming days to work on this bug, do you have any suggestions on what would be the first place for me to look?

jcrudy commented 5 years ago

@CherylCB That's great. I'll take all the help I can get. Going off of the current master, I'd start by looking here. As you'll see, I wrote some special code to try to prevent exactly what you are seeing. One of two possible things is probably happening:

  1. The has_linear method in that line is not correct. The has_linear method is defined as part of the BasisFunction abstract class, which you'll find in _basis.pyx. It calls linear_in, which has different implementations for the different subclasses. Possibly something in this logic is wrong.

  2. The forward pass is not using variable_can_be_linear correctly in some specific situations. If this is what's happening, you'll have to go over the logic of the forward pass and see where the linear basis function is being added in your example, then figure out how to prevent it from happening. If you find any clues here they could be very helpful for me, even if you're not sure how to fully solve the problem.

You're actually in a good position to figure this out, since you have an example data set that shows the problem. I'd suggest you try to debug what's happening in the forward pass using a script that fits a model to your data set. Unfortunately, it's hard to set up a debugger to work with cython. Perhaps you're more skilled than me in this area, but if not I suggest you just use print statements in the cython files.

The workflow is something like this:

  1. Clone the py-earth repo
  2. cd into the repo and build the code using: python setup.py build_ext --inplace --cythonize
  3. cd back out of the repo. Put your test script in the same directory as the repo (not inside the repo)
  4. Make sure your python environment doesn't have py-earth installed.
  5. Now when you run your script, it should import pyearth from the repo.
  6. Whenever you change a cython file (such as to add print statements, for example), you'll need to run the command from step 2 again.

If you have any problems, don't hesitate to get in touch. You can reply here or email me (my address is on my github profile). You're potentially saving me a lot of time by working on this, so of course I'm very happy to spend some time helping you succeed at it. Good luck!