jmcarpenter2 / parfit

A package for parallelizing the fit and flexibly scoring of sklearn machine learning models, with visualization routines.
MIT License
198 stars 30 forks source link

error: 'i' format requires -2147483648 <= number <= 2147483647 #9

Open avinash-mishra opened 5 years ago

avinash-mishra commented 5 years ago

Hi, I get error: 'i' format requires -2147483648 <= number <= 2147483647 Doing exactly same as README.md except I am using RandomForestRegressor()

Full error :

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 346, in _sendback_result
    exception=exception))
  File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/externals/loky/backend/queues.py", line 241, in put
    self._writer.send_bytes(obj)
  File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

error                                     Traceback (most recent call last)
<ipython-input-10-f10ba30832f6> in <module>
     11                                                     X_train_5, y_train_5, X_test_5, y_test_5, # nfolds=5 [optional, instead of validation set]
     12                                                     metric=roc_auc_score, greater_is_better=True,
---> 13                                                     scoreLabel='AUC')
     14 
     15 print(best_model, best_score)

~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/parfit/parfit.py in bestFit(model, paramGrid, X_train, y_train, X_val, y_val, nfolds, metric, greater_is_better, predict_proba, showPlot, scoreLabel, vrange, cmap, n_jobs, verbose)
     63     else:
     64         print("-------------FITTING MODELS-------------")
---> 65         models = fitModels(model, paramGrid, X_train, y_train, n_jobs, verbose)
     66         print("-------------SCORING MODELS-------------")
     67         scores = scoreModels(models, X_val, y_val, metric, predict_proba, n_jobs, verbose)

~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/parfit/fit.py in fitModels(model, paramGrid, X, y, n_jobs, verbose)
     49         myModels = fitModels(model, paramGrid, X_train, y_train)
     50     """
---> 51     return Parallel(n_jobs=n_jobs, verbose=verbose)(delayed(fitOne)(model, X, y, params) for params in paramGrid)

~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
    994 
    995             with self._backend.retrieval_context():
--> 996                 self.retrieve()
    997             # Make sure that we get a last message telling us we are done
    998             elapsed_time = time.time() - self._start_time

~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
    897             try:
    898                 if getattr(self._backend, 'supports_timeout', False):
--> 899                     self._output.extend(job.get(timeout=self.timeout))
    900                 else:
    901                     self._output.extend(job.get())

~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    515         AsyncResults.get from multiprocessing."""
    516         try:
--> 517             return future.result(timeout=timeout)
    518         except LokyTimeoutError:
    519             raise TimeoutError()

~/anaconda3/envs/venv_py3.6/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

~/anaconda3/envs/venv_py3.6/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

error: 'i' format requires -2147483648 <= number <= 2147483647

Please help.

jmcarpenter2 commented 5 years ago

Hi @avinash-mishra,

Thanks for raising this issue. Could you please share with me the ParameterGrid object you are searching over?

avinash-mishra commented 5 years ago

Hi @jmcarpenter2 Thanks for the quick reply.

grid = {
    'min_samples_leaf': [1, 5, 10],
    'max_features': ['sqrt'],
    'n_estimators': [60],
    'n_jobs': [-1],
    'random_state': [42]
}
paramGrid = ParameterGrid(grid)

best_model, best_score, all_models, all_scores = bestFit(RandomForestRegressor(), paramGrid,
                                                    X_train_5, y_train_5, X_test_5, y_test_5, # nfolds=5 [optional, instead of validation set]
                                                    metric=roc_auc_score, greater_is_better=True, 
                                                    scoreLabel='AUC')

print(best_model, best_score)

ParameterGrid is exactly same as given in README file. I tried to search and found a SO Link

Some people have said that pickling the model object is way too heavy. My df looks like this.

display(X_train_5.shape)
display(y_train_5.shape)
display(X_test_5.shape)
display(y_test_5.shape)

(16861, 119)
(16861, 329)
(1240, 119)
(1240, 329)

I hope it will be helpful for you to look into the issue and suggest some fix.

jmcarpenter2 commented 5 years ago

Hi @avinash-mishra,

This is an interesting issue. It appears it has something to do with the combination of trying to train models on massive dataframes, and the fact that parfit underlying utilizes multiprocessing rather than multithreading. I will look into solutions, but it may take awhile to actually implement a fix that resolves your use case.

As a side note, I am wondering why your y_train_5 and y_test_5 dataframes have 210 more columns than the X_train_5 and X_test_5? Shouldnt y be a pandas series (i.e. a 1 column dataframe)?

Thanks

avinash-mishra commented 5 years ago

Hi @jmcarpenter2 It is a multi-column regression issue. A specific use case. I have to predict multiple columns not only one.