scikit-learn-contrib / py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
http://contrib.scikit-learn.org/py-earth/
BSD 3-Clause "New" or "Revised" License
455 stars 121 forks source link

Question regarding the effect of use_fast when max_degree=1 #162

Closed odedbd closed 6 years ago

odedbd commented 6 years ago

Hi,

I have two virtualenvs which have py-earth installed. One is pip installed from master (today) and the other we pip installed from commit https://github.com/scikit-learn-contrib/py-earth/commit/30abe9623b0338a25069395ffab1d9611fc094a9 (Mar 31, 2016).

Using each virtualenv I run the same code, fitting an EARTH model to some data using 5 fold cross validation. In all cases I use the same max_terms=100 and max_degree=1 parameters. I vary the parameters use_fast=True/False and fast_K=2/20.

I find that with the current code the running time is the same regardless of whether use_fast is True or False and it is not effected by changing fast_K from 2 to 20. With the specific dataset I ma using this time is between 35-38sec for the full CV. With the older code, however, I find that although setting use_fast=False takes roughly the same time to train the full CV, setting use_fast=True and fast_K=2 reduces the training time to just under 2sec on the same dataset. Increasing to fast_K=20 trains the full CV at roughly 10sec.

To summarize: With the new code using FastMARS has no effect on running time when max_degree=1. With the older code using FastMARS has a dramatic effect on running time when max_degree=1.

From my understanding of the MARS algorithm I expect the fast algorithm to have no effect on running time for max_degree=1. The reason as I see it is that FastMARS only optimizes the selection of parent candidates for base additions. When max_degree=1 no basis function with a knot can be selected as a parent, so only the constant term needs be considered as a parent. So, the number of eligible parents is already minimized to 1. With this in mind, I would think that the new code is the correct one, and the older code has some kind of bug.

My question is which version of the code is correct? Is this a regression bug in the new code or is the older code doing something weird when max_degree=1 and use_fast=True?

Thank you for taking the time to read, I'm looking forward to your replies.

jcrudy commented 6 years ago

@odedbd Your understanding is correct. The old version had some bugs, and the new version is (more likely to be) correct.

odedbd commented 6 years ago

@jcrudy Thank you for your feedback. And for this great package.