gaoexingyun commented 4 years ago

When I use HyperoptEstimator to train the model, I want to calculate logloss, so I manually set several classifiers to support predict_proba, but it will report errorError: Found input variables with inconsistent numbers of samples: [1818794, 79078]. What is the reason?

Following is my code：

from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFold from sklearn.metrics import log_loss from hpsklearn import sklearn_RandomForestClassifier,extra_trees,random_forest,knn,svc from hyperopt import hp from random import choice def algorithms(name): classifiers = [

    random_forest(name + '.random_forest')
]

return choice(classifiers)

kf = StratifiedKFold(n_splits=5,shuffle=True)

use sigmoid kernal to train model

Hyper_logloss = [] for train_index, test_index in kf.split(X,y): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] estim = HyperoptEstimator(classifier=algorithms("clf"),algo=tpe.suggest,loss_fn=log_loss,continuous_loss_fn=True,trial_timeout=3600) estim.fit(X_train, y_train)

ValueError Traceback (most recent call last)

in 19 y_train, y_test = y.iloc[train_index], y.iloc[test_index] 20 estim = HyperoptEstimator(classifier=algorithms("clf"),algo=tpe.suggest,loss_fn=log_loss,continuous_loss_fn=True,trial_timeout=3600) ---> 21 estim.fit(X_train, y_train) ~/anaconda3/hyperopt-sklearn/hpsklearn/estimator.py in fit(self, X, y, EX_list, valid_size, n_folds, cv_shuffle, warm_start, random_state, weights) 778 increment = min(self.fit_increment, 779 adjusted_max_evals - len(self.trials.trials)) --> 780 fit_iter.send(increment) 781 if filename is not None: 782 with open(filename, 'wb') as dump_file: ~/anaconda3/hyperopt-sklearn/hpsklearn/estimator.py in fit_iter(self, X, y, EX_list, valid_size, n_folds, cv_shuffle, warm_start, random_state, weights, increment) 688 # so we notice them. 689 catch_eval_exceptions=False, --> 690 return_argmin=False, # -- in case no success so far 691 ) 692 else: ~/anaconda3/lib/python3.7/site-packages/hyperopt/fmin.py in fmin(fn, space, algo, max_evals, trials, rstate, allow_trials_fmin, pass_expr_memo_ctrl, catch_eval_exceptions, verbose, return_argmin, points_to_evaluate, max_queue_len, show_progressbar) 386 catch_eval_exceptions=catch_eval_exceptions, 387 return_argmin=return_argmin, --> 388 show_progressbar=show_progressbar, 389 ) 390 ~/anaconda3/lib/python3.7/site-packages/hyperopt/base.py in fmin(self, fn, space, algo, max_evals, rstate, verbose, pass_expr_memo_ctrl, catch_eval_exceptions, return_argmin, show_progressbar) 637 catch_eval_exceptions=catch_eval_exceptions, 638 return_argmin=return_argmin, --> 639 show_progressbar=show_progressbar) 640 641 ~/anaconda3/lib/python3.7/site-packages/hyperopt/fmin.py in fmin(fn, space, algo, max_evals, trials, rstate, allow_trials_fmin, pass_expr_memo_ctrl, catch_eval_exceptions, verbose, return_argmin, points_to_evaluate, max_queue_len, show_progressbar) 405 show_progressbar=show_progressbar) 406 rval.catch_eval_exceptions = catch_eval_exceptions --> 407 rval.exhaust() 408 if return_argmin: 409 return trials.argmin ~/anaconda3/lib/python3.7/site-packages/hyperopt/fmin.py in exhaust(self) 260 def exhaust(self): 261 n_done = len(self.trials) --> 262 self.run(self.max_evals - n_done, block_until_done=self.asynchronous) 263 self.trials.refresh() 264 return self ~/anaconda3/lib/python3.7/site-packages/hyperopt/fmin.py in run(self, N, block_until_done) 225 else: 226 # -- loop over trials and do the jobs directly --> 227 self.serial_evaluate() 228 229 try: ~/anaconda3/lib/python3.7/site-packages/hyperopt/fmin.py in serial_evaluate(self, N) 139 ctrl = base.Ctrl(self.trials, current_trial=trial) 140 try: --> 141 result = self.domain.evaluate(spec, ctrl) 142 except Exception as e: 143 logger.info('job exception: %s' % str(e)) ~/anaconda3/lib/python3.7/site-packages/hyperopt/base.py in evaluate(self, config, ctrl, attach_attachments) 842 memo=memo, 843 print_node_on_error=self.rec_eval_print_node_on_error) --> 844 rval = self.fn(pyll_rval) 845 846 if isinstance(rval, (float, int, np.number)): ~/anaconda3/hyperopt-sklearn/hpsklearn/estimator.py in fn_with_timeout(*args, **kwargs) 651 assert fn_rval[0] in ('raise', 'return') 652 if fn_rval[0] == 'raise': --> 653 raise fn_rval[1] 654 655 # -- remove potentially large objects from the rval ValueError: Found input variables with inconsistent numbers of samples: [1818794, 79078]

lucasmejiall commented 4 years ago

I had the same problem. I believe it's this line causing the problem in estimator.py line 333:

if continuous_loss_fn: cv_pred_pool = np.append(cv_pred_pool, learner.predict_proba(XEXval))

It should be:

if continuous_loss_fn: cv_pred_pool = np.append(cv_pred_pool, learner.predict_proba(XEXval)[:, 1])

jhmenke commented 4 years ago

I had the same problem. I believe it's this line causing the problem in estimator.py line 333:

if continuous_loss_fn: cv_pred_pool = np.append(cv_pred_pool, learner.predict_proba(XEXval))

It should be:

if continuous_loss_fn: cv_pred_pool = np.append(cv_pred_pool, learner.predict_proba(XEXval)[:, 1])

What is the purpose of this? You would only select a single class probability per sample from predict_proba

linehammer commented 3 years ago

Sounds like the shapes of your labels and predictions are not in alignment. I faced a similar problem while fitting a regression model . The problem in my case was, Number of rows in X was not equal to number of rows in y. You likely get problems because you remove rows containing nulls in X_train and y_train independent of each other. y_train probably has few, or no nulls and X_train probably has some. So when you remove a row in X_train and the same row is not removed in y_train it will cause your data to be unsynced and have different lenghts. Instead you should remove nulls before you separate X and y.

In most case, x as your feature parameter and y as your predictor. But your feature parameter should not be 1D. So check the shape of x and if it is 1D, then convert it from 1D to 2D.

x.reshape(-1,1)

WandrilleD commented 2 weeks ago

2024 and this issue is still present. I think the solution has been rightfully pointer by @jhmenke above.

I write here a "minimalist" demonstration of the problem, as well as a little fix for people encountering this issue in their code:

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score
from hpsklearn import HyperoptEstimator,svc

X,y = load_breast_cancer(return_X_y=True)

def roc_auc_loss(y_target, y_prediction ):
    print( y_target.shape , y_prediction.shape ) ## looking up shapes
    print( y_prediction[:10] ) ## looking up what's in the predictions
    p0 = y_prediction[::2] ## half of the elements
    p1 = y_prediction[1::2] ## other half of the elements
    print( (p0+p1)[:10] )  ## what do they sum to?

    return -roc_auc_score(y_target, y_prediction)

estim = HyperoptEstimator(classifier=svc("mySVC",probability=True),
                          loss_fn = roc_auc_loss,
                          continuous_loss_fn = True,
                          trial_timeout=10)

estim.fit(X, y)

which yields:

(114,)                                                                                                                                                                                   
(228,)                                                                                                                                                                                   
[0.41895505 0.58104495 0.22532898 0.77467102 0.07204063 0.92795937                                                                                                                       
 0.03291461 0.96708539 0.0034852  0.9965148 ]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]                                                                                                                                                          
  0%|                                                                                                                                              | 0/1 [00:00<?, ?trial/s, best loss=?]

job exception: Found input variables with inconsistent numbers of samples: [114, 228]

followed by the error traceback.

As we can see the error stems from the fact the predict_proba() returns an array with 1 column per category, which is flattened and then passed to the scoring function. We confirm that by checking that we get 1.0 when summing together each odd and even indexed elements.

Here is my quick fix if you don't want to wait for the library to be updated:

def roc_auc_loss_fixed(y_target, y_prediction ):
    p1 = y_prediction[1::2] ## half of the elements corresponding to proba of being category 1    
    return -roc_auc_score(y_target, p1)

estim = HyperoptEstimator(classifier=svc("mySVC",probability=True),
                          loss_fn = roc_auc_loss_fixed,
                          continuous_loss_fn = True,
                          trial_timeout=10)

estim.fit(X, y)

hyperopt / hyperopt-sklearn

alueError: Found input variables with inconsistent numbers of samples: [1818794, 79078] #150

use sigmoid kernal to train model