scikit-learn-contrib / py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
http://contrib.scikit-learn.org/py-earth/
BSD 3-Clause "New" or "Revised" License
455 stars 121 forks source link

MemoryError: #131

Closed yitang closed 7 years ago

yitang commented 7 years ago

Thanks for writing this package, it helps me a lot in switching from R to python.

I tried to train a mars model but got a MemeoryError, but I don't understand why. the data size is certainly not big, and I am able to train it in R.

/home/yitang/kaggle/redhat/mars.py in <module>()
     88                   verbose=True)
     89 
---> 90 scores = mars_lr.cv_score(X, y)
     91 # mars_lr.fit(X, y)
     92 # pred_prob_y = mars_lr.predict_log_proba(X)

/home/yitang/kaggle/redhat/mars.py in cv_score(self, X, y, cv_iter)
     56         else:
     57             X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2)
---> 58             self.fit(X_train, y_train)
     59             pred = self.predict(X_test)
     60             pred_class = np.where(pred[:, 0] >= 0.5, 0, 1)

/home/yitang/kaggle/redhat/mars.py in fit(self, X, y)
     27 
     28     def fit(self, X, y):
---> 29         self.mars.fit(X, y)
     30         mars_X = self.mars.transform(X)
     31         self.lr.fit(mars_X, y)

/usr/local/lib/python3.4/dist-packages/py_earth-0.1.0-py3.4-linux-x86_64.egg/pyearth/earth.py in fit(self, X, y, sample_weight, output_weight, missing, xlabels, linvars)
    589         self.forward_pass(X, y,
    590                           sample_weight, output_weight, missing,
--> 591                           self.xlabels_, linvars, skip_scrub=True)
    592         if self.enable_pruning is True:
    593             self.pruning_pass(X, y,

/usr/local/lib/python3.4/dist-packages/py_earth-0.1.0-py3.4-linux-x86_64.egg/pyearth/earth.py in forward_pass(self, X, y, sample_weight, output_weight, missing, xlabels, linvars, skip_scrub)
    704         forward_passer = ForwardPasser(
    705             X, missing, y, sample_weight,
--> 706             xlabels=self.xlabels_, linvars=linvars, **args)
    707         forward_passer.run()
    708         self.forward_pass_record_ = forward_passer.trace()

/usr/local/lib/python3.4/dist-packages/py_earth-0.1.0-py3.4-linux-x86_64.egg/pyearth/_forward.cpython-34m.so in pyearth._forward.ForwardPasser.__init__ (pyearth/_forward.c:4891)()

/usr/local/lib/python3.4/dist-packages/numpy/core/numeric.py in ones(shape, dtype, order)
    188 
    189     """
--> 190     a = empty(shape, dtype, order)
    191     multiarray.copyto(a, 1, casting='unsafe')
    192     return a

MemoryError: 
In [1]: X.shape
Out[1]: (2197291, 56)

In [2]: y.shape
Out[2]: (2197291,)
jcrudy commented 7 years ago

@yitang I'm not sure what the problem is, but I'd like to find out in case it's a bug in py-earth. Could you try the following:

  1. Set max_terms=150 or some other reasonable number and see if that fixes it.
  2. If that doesn't work, could you post any more of your code? An example I can run, complete with data, would be best. However, just having the code might allow me to figure out what's going on. From your stack trace I can see that py-earth has been wrapped by some other code (mars.py), so I want to make sure mars.py is passing reasonable data into Earth.fit.

It's worth noting that py-earth is probably not as memory efficient as the R package earth, so there may be some sized problems that work on earth but not on py-earth. I wouldn't think this would be one of them, however.

jcrudy commented 7 years ago

I just had a look in the code, and indeed I think that max_terms is the problem. I'm going to open a separate issue around it, but please report back if setting max_terms fixes your problem.

yitang commented 7 years ago

thanks for your quick reply. set max_terms to 100 solve this problem.

jcrudy commented 7 years ago

Excellent. Thanks for reporting, @yitang.