scikit-learn-contrib / py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
http://contrib.scikit-learn.org/py-earth/
BSD 3-Clause "New" or "Revised" License
457 stars 122 forks source link

Supporting multi-column regression #80

Closed mehdidc closed 9 years ago

mehdidc commented 9 years ago

PR for issue #49.

To recall what I said in #49: In the earth R package (Check section 2.9, http://www.milbo.org/doc/earth-notes.pdf) , it is one model for each column of Y except that that they share the same set of basis functions and the models are optimized to minimize the overall GCV (the sum of GCVs of all the models). However they don't share the basis functions coeficients, each model has its own set of coeficients for basis functions.

Example of fitting two sine functions here : http://imgur.com/cLlwpb9

I modified the code in pyearth/_forward.pyx and pyearth/_pruning.pyx allowing multiple outputs but keeping the same "semantics".

Some modifications that have been done as a consquence of this:

jcrudy commented 9 years ago

@mehdidc, this is an amazing PR. I've been wanting this feature for a while. I found one bug, which I'll comment on in the code. The other thing is I'm curious how it affects speed. Have you done any performance comparisons? I can do some on my own, but if you have results already then that would be easier. The last thing is about sample_weight. Eventually it would be nice to have sample_weight be the same dimensions as y (so that different columns in y can be weighted differently). I would like to merge before worrying about that last thing, though.

mehdidc commented 9 years ago

Indeed the indices are reversed, thank you for noticing the bug ! I added support for 2D weights. In cases where only the basis B and the examples X are involved (e.g in pyearth/_forward.pyx line 71), the average of the weights over the outputs dimension of each example is used.

No, I didn't do speed tests yet, but it is something I want to do !

mehdidc commented 9 years ago

Some plots here of how the time duration scales with the number of outputs:

The wiggles are probably due the fact that I enabled pruning but in principle the duration should increase monotonically, will retest without it.

First plot, the number of outputs grows, everything else fixed :

Imgur

Second plot, nb of outputs and nb of examples both grow, everything else fixed:

Imgur

Third plot, nb of outputs and nb of input features both grow, everything else fixed:

Imgur

Fourth plot, nb of outputs and max nb of terms both grow, everything else fixed: Imgur

Source code here : https://gist.github.com/mehdidc/4fae166cf492590adedd72c9bd5628b5 or http://pastebin.com/NzQU5dFD

jcrudy commented 9 years ago

@mehdidc You wrote:

I added support for 2D weights. In cases where only the basis B and the examples X are involved (e.g in pyearth/_forward.pyx line 71), the average of the weights over the outputs dimension of each example is used.

Can you explain the math of why this works in a little more detail? I think this may not be correct, but I'm happy to be proved wrong.

Thanks for the timing! I don't understand the axes for the plots where multiple things are growing at once, but it looks linear in number of outputs, which makes sense to me. I'd also like to see how speed compares to the current master branch on single output problems. Perhaps a plot with two lines (one for master and one for this PR) and problem size (rows or columns) as the x-axis.

mehdidc commented 9 years ago

Indeed it is not correct, thank you for noticing, I am working on it, we actually need one basis space matrix per output, so B in "pyearth/_forward.pyx" will be transformed into a 3D tensor (an additional dimension for the outputs).

jcrudy commented 9 years ago

That matches my own understanding of it, and seems like the best way forward.

Technically, you could get away with just having a single B matrix, with a separate B_orth matrix for each column of the weight matrix. The orthonormal_update and best_knot code would have to be changed to take weight into account. I'm not advocating this approach, though, as it would be a lot more work to implement. Currently, weight is implicitly accounted for because it is multiplied into B, and while having a single B would save some memory, I think it's maybe not worth the effort.

jcrudy commented 9 years ago

BTW, in the R package earth, Stephen Milborrow gets around this problem by only allowing a separable weight matrix. That is, there are row weights and (output) column weights, so the resulting weight matrix is basically an outer product of the two. That way no additional copy of B or B_orth is needed because they would all be simply scalar multiples of each other.

mehdidc commented 9 years ago

Yeah with 2D (full) sample weights the is getting a bit more complicated, I have some bugs for now. I am in favor of this solution, it would be easier to implement and to check. What about doing like in the R package for now and update it to a full matrix of sample weights if needed ?

jcrudy commented 9 years ago

@mehdidc Yes, that sounds good to me. Eventually it would be nice to have specialized code for both separable and non-separable weights, but it's a lot of work. I see no reason not to start with the separable case, which still covers a lot of use cases (including all the ones supported by earth in R).

HJarausch commented 9 years ago

Hi, please change map(..) to list(map(..)) in the example above - this is necessary for Python3.x

Since this example takes several minutes to complete, I wonder if there are any plans to use multiple cores?

Thanks, Helmut

jcrudy commented 9 years ago

@HJarausch Thanks for that fix.

If anyone is trying to use @mehdidc's linked example above from Python 3, please see @HJarausch's comment for a necessary fix.

please change map(..) to list(map(..)) in the example above - this is necessary for Python3.x

Regarding using multiple cores, it's something I've looked into. The forward pass can definitely benefit from multiple cores. However, it's a lot of work to do, so I'm not sure exactly when it will happen. There are still some single core optimizations that haven't yet been implemented, such as "fast MARS", which would be easier and might happen sooner.

HJarausch commented 9 years ago

I have difficulties to use the multi-column version. I tried to fit a disturbed ellipse in 2-D space as follows (with the usal imports) Num= 20
X = np.linspace(0,2_pi,Num).reshape(Num,1) XX= np.linspace(0,2_pi,2_Num).reshape(2_Num,1)
C = 2
delta=0.1 u=C_np.cos(X) v=np.sin(X)
y = np.column_stack((u+delta_np.random.uniform(size=(Num,1)),
v+delta*np.random.uniform(size=(Num,1))) ) model = Earth(max_degree=3,max_terms=Num//2)
model.fit(X, y)
y_hat= model.predict(XX)

But y_hat is complete different from an ellipse. What am I missing?

Many thanks for a hint, Helmut

jcrudy commented 9 years ago

The problem is that with such a small number of points, py-earth is approximating your sine function as linear. You could maybe get the effect you want by reducing the penalty parameter or setting allow_linear to False. I just increased the number of points and it worked fine. Here's my code:

import numpy as np
from pyearth import Earth
from matplotlib import pyplot

Num= 40
pi = 3.14159

X = np.linspace(0,2*pi,Num).reshape(Num,1)
XX= np.linspace(0,2*pi,2*Num).reshape(2*Num,1)

C = 2

delta=0.1
u=C*np.cos(X)
v=np.sin(X)

y = np.column_stack((u+delta*np.random.uniform(size=(Num,1)), 
                     v+delta*np.random.uniform(size=(Num,1))))

model = Earth(max_degree=3,max_terms=Num//2)

model.fit(X, y)
y_hat= model.predict(XX)

pyplot.plot(XX, y_hat[:,0])
pyplot.plot(X, y[:,0], hold=True)
pyplot.show()

pyplot.plot(XX, y_hat[:,1])
pyplot.plot(X, y[:,1], hold=True)
pyplot.show()
HJarausch commented 9 years ago

On 08/05/2015 07:55:04 PM, Jason Rudy wrote:

The problem is that with such a small number of points, py-earth is approximating your sine function as linear. You could maybe get the effect you want by reducing the penalty parameter or setting allow_linear to False. I just increased the number of points and it worked fine. Here's my code:

import numpy as np
from pyearth import Earth
from matplotlib import pyplot

Num= 40
pi = 3.14159

X = np.linspace(0,2*pi,Num).reshape(Num,1)
XX= np.linspace(0,2*pi,2*Num).reshape(2*Num,1)

C = 2

delta=0.1
u=C*np.cos(X)
v=np.sin(X)

y = np.column_stack((u+delta*np.random.uniform(size=(Num,1)),
                     v+delta*np.random.uniform(size=(Num,1))))

model = Earth(max_degree=3,max_terms=Num//2)

model.fit(X, y)
y_hat= model.predict(XX)

pyplot.plot(XX, y_hat[:,0])
pyplot.plot(X, y[:,0], hold=True)
pyplot.show()

pyplot.plot(XX, y_hat[:,1])
pyplot.plot(X, y[:,1], hold=True)
pyplot.show()

Many thanks Jason. Still, I don't understand what's going on. Even with allow_linear=False and penalty=1, the sine is still approximated linear. But why, the approximation error is really large?

P.S. Perhaps I could help in this project. I'm a mathematicians having worked in numerical analysis and I have some years experience in programming, especially in C++ and Python.

Helmut

HJarausch commented 9 years ago

I just noted that the enable_pruning paramter isn't accepted anymore. Is this by intention?

jcrudy commented 9 years ago

@HJarausch I just played around with it a little, and it looks like you can actually get a better result with 20 points if you set the penalty=1 and minspan=1. Regarding the enable_pruning parameter, it was added to master after the branch for multi-column regression was created, which is why it's not present here. Once multi-column is merged into master, both features will be available.

I'm always welcoming of help on py-earth. There are many open issues, and if there is any that you'd be interested in addressing please feel free. You can start by commenting on the issue to say you're working on it (so nobody duplicates your effort) and then go ahead and create a pull request. If you have any questions about specifics, just ask in the thread for the issue. I don't think there is an issue for multi-core support yet. Feel free to create an issue for it, or for anything else you see that py-earth needs. I wouldn't recommend multi-core as a good starting issue to work on, though, as it's quite a lot of work.

mehdidc commented 9 years ago

I just added support of weights for outputs, a new (optional) parameter of fit (pyearth/earth.py) has been added : output_weight. It should be a vector, the ith value of this vector is the weight of the ith output. An example is provided in examples/plot_output_weight.py.

Thanks @HJarausch, I fixed the code and updated the link

jcrudy commented 9 years ago

This looks great. I fixed a few tests and things and merged it because I think it's useful as is. However, I think there are still a few things that need doing. In particular, have you updated the summary() method to display coefficients for all the outputs? Also, I think the export functions still need attention?

mehdidc commented 9 years ago

Thanks for merging, I agree, I will update summary() and export functions to deal with multi outputs. Which other things you think must be added ?