scikit-learn-contrib / py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
http://contrib.scikit-learn.org/py-earth/
BSD 3-Clause "New" or "Revised" License
455 stars 121 forks source link

Unexplained behaviour of Stopping condition 0: Reached maximum number of terms #180

Open nikrepp opened 6 years ago

nikrepp commented 6 years ago

Hello, colleagues,

I have the following problem: using PyEarth for classification task on dataset with 300000 rows and more than 500 features, I set max_terms to sufficiently high number (i.e. 100). But after two iterations everything stopped and Stopping condition 0: Reached maximum number of terms appears.

import numpy from pyearth import Earth from sklearn.linear_model import ElasticNet from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler

from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split

model = Pipeline([('earth',Earth(max_degree=4,max_terms=100, verbose=True, enable_pruning=False)), ('enet',ElasticNet(l1_ratio=0.0,alpha=1.0))])

X_t = StandardScaler().fit_transform(X_t) model.fit(X_t, Y_t*100)

Beginning forward pass

iter parent var knot mse terms gcv rsq grsq

0 - - - 34.148441 1 34.149 0.000 0.000
1 0 180 114453 34.135289 3 34.137 0.000 0.000

Stopping Condition 0: Reached maximum number of terms

May be I am just doing something wrong or whatever? From metrics I got I can see that model is pretty robust, but underfitted.

Nikita

jcrudy commented 6 years ago

@nikrepp I don't see any obvious problems with what you're doing. That seems like a pretty severe issue, though, so I'm surprised to be seeing it for the first time now. Here are a few questions that might help me:

  1. Is the code you included above the complete program that produces the error?
  2. Does the issue seem to depend on your data set, or does it happen with any data you use?
  3. Can you tell me what your operating system, python version, numpy, scipy, and scikit-learn versions are?
  4. How did you install py-earth, and what is pyearth.__version__?
nikrepp commented 6 years ago

Hello Jason,

see answers for your questions.

  1. Complete program is here. Target is very low (0.0035).

import pandas as pd import numpy as np

Read target

dataset = pd.read_csv('....csv', sep=',', encoding='cp1251') dataset = dataset.head(10000)

y = dataset[u'Флаг рефинансирования'] X = dataset.drop(dataset.columns[[0,1,2,3,6]], axis=1)

import pyearth import scipy import sklearn import numpy print(pyearth.version) print(numpy.version) print(scipy.version) print(sklearn.version)

import numpy from pyearth.earth import Earth from sklearn.linear_model import ElasticNet from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler

from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split

model = Pipeline([('earth',Earth(max_degree=4,max_terms=10, minspan_alpha=10, verbose=True, enable_pruning=False)), ('enet',ElasticNet(l1_ratio=0.0,alpha=1.0))])

X = StandardScaler().fit_transform(X) model.fit(X, y)

Beginning forward pass

iter parent var knot mse terms gcv rsq grsq

0 - - - 0.002394 1 0.002 0.000 0.000
1 0 304 5228 0.002354 3 0.002 0.017 0.016
2 1 344 7108 0.002295 5 0.002 0.042 0.040
3 2 160 3478 0.002273 7 0.002 0.051 0.048
4 5 573 1411 0.002195 9 0.002 0.083 0.080
5 6 450 4536 0.002195 11 0.002 0.083 0.079

Stopping Condition 0: Reached maximum number of terms

C:\Users\I304909\AppData\Local\Continuum\Miniconda2\envs\tensorflow\lib\site-packages\sklearn\linear_model\coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems. ConvergenceWarning)

Out[4]:

Pipeline(memory=None, steps=[('earth', Earth(allow_linear=None, allow_missing=False, check_every=None, enable_pruning=False, endspan=None, endspan_alpha=None, fast_K=None, fast_h=None, feature_importance_type=None, max_degree=4, max_terms=10, min_search_points=None, minspan=None, minspan_alpha=10, penalty=None, ...alse, precompute=False, random_state=None, selection='cyclic', tol=0.0001, warm_start=False))])

  1. I've tested it on Adult dataset from UCI, it works!

import pandas as pd import numpy as np

Read target

dataset = pd.read_csv('C:/.../Census01.csv', sep=';', encoding='utf8') dataset = dataset

for i in dataset.columns: dataset[i] = dataset[i].factorize()[0].astype(np.int32)

y=dataset['age'] X = dataset.drop(dataset.columns[[0]], axis=1) model2 = Pipeline([('earth',Earth(max_degree=4,max_terms=10, verbose=True, enable_pruning=False)), ('enet',ElasticNet(l1_ratio=0.0,alpha=1.0))])

X = StandardScaler().fit_transform(dataset) model2.fit(X, y)

Beginning forward pass

iter parent var knot mse terms gcv rsq grsq

0 - - - 241.716883 1 241.727 0.000 0.000
1 0 4 -1 238.942474 2 238.977 0.011 0.011
2 1 4 -1 235.893861 3 235.952 0.024 0.024
3 0 1 -1 234.005053 4 234.087 0.032 0.032
4 1 6 -1 232.915885 5 233.021 0.036 0.036
5 0 11 -1 231.898621 6 232.027 0.041 0.040
6 0 9 19353 231.112850 8 231.288 0.044 0.043
7 0 0 -1 230.395323 9 230.594 0.047 0.046
8 8 5 -1 229.583339 10 229.804 0.050 0.049
9 0 2 -1 229.275825 11 229.520 0.051 0.050

Stopping Condition 0: Reached maximum number of terms

  1. Windows 10, python: I've tested 2.7 and 3 (the same behavior). PyEarth, Numpy, Scipy, Sklearn: 0.1.0 1.13.3 1.0.0 0.19.1

  2. I tried different ways, last way through Conda, first - building from source (the same behavior).

Thanks! I also very interested what is that.

jcrudy commented 6 years ago

@nikrepp Thanks for all the info. In the code you pasted above, you set max_terms to 10, and the forward pass terminated after 5 iterations. That is expected behavior as each iteration produces 2 terms (assuming it finds a knot that is superior to the linear term). Is that the problem you are observing, or is there other worse behavior you're seeing? The reason it goes to iteration 9 on the UCI data set is that it is picking linear basis functions (knot = -1), which only add one term each.

nikrepp commented 6 years ago

Hello Jason,

fortunately, I can not reproduce weird behaviour anymore, so I prefer thinking it was corrupted install from sources under Python2 on Windows.

Thank you for all the details. I am looking forward for development of this framework for classification problems objectives, better support for categorical predictors and interpretation of fitted relationships.

Thanks!

P.S. You can give me a pleasure with a possibility to contribute in one of this topics.

Regards, Nikita