Open avinash-mishra opened 5 years ago
Hi @avinash-mishra,
Thanks for raising this issue. Could you please share with me the ParameterGrid object you are searching over?
Hi @jmcarpenter2 Thanks for the quick reply.
grid = {
'min_samples_leaf': [1, 5, 10],
'max_features': ['sqrt'],
'n_estimators': [60],
'n_jobs': [-1],
'random_state': [42]
}
paramGrid = ParameterGrid(grid)
best_model, best_score, all_models, all_scores = bestFit(RandomForestRegressor(), paramGrid,
X_train_5, y_train_5, X_test_5, y_test_5, # nfolds=5 [optional, instead of validation set]
metric=roc_auc_score, greater_is_better=True,
scoreLabel='AUC')
print(best_model, best_score)
ParameterGrid is exactly same as given in README file. I tried to search and found a SO Link
Some people have said that pickling the model object is way too heavy. My df looks like this.
display(X_train_5.shape)
display(y_train_5.shape)
display(X_test_5.shape)
display(y_test_5.shape)
(16861, 119)
(16861, 329)
(1240, 119)
(1240, 329)
I hope it will be helpful for you to look into the issue and suggest some fix.
Hi @avinash-mishra,
This is an interesting issue. It appears it has something to do with the combination of trying to train models on massive dataframes, and the fact that parfit underlying utilizes multiprocessing rather than multithreading. I will look into solutions, but it may take awhile to actually implement a fix that resolves your use case.
As a side note, I am wondering why your y_train_5
and y_test_5
dataframes have 210 more columns than the X_train_5
and X_test_5
? Shouldnt y be a pandas series (i.e. a 1 column dataframe)?
Thanks
Hi @jmcarpenter2 It is a multi-column regression issue. A specific use case. I have to predict multiple columns not only one.
Hi, I get
error: 'i' format requires -2147483648 <= number <= 2147483647
Doing exactly same as README.md except I am using RandomForestRegressor()Full error :
Please help.