sanity check for predict_proba in classification

tmontana commented 4 years ago

Hi. I have a test to check automl.predict_proba(). Up to recently I got the exact same results but not anymore. I only use an XGB ensemble, no stacking, golden features or feature selection. Any idea why the difference? The two methods yield results that are close but not equal (R2 = 0.98).

Am I missing something in the code below?

Thanks,

def get_predictions(automl, X): predictions = None models_div_count = 0

ensemble_obj=automl._best_model
count_selected_models = 0

print(ensemble_obj.selected_models)

for selected_model in ensemble_obj.selected_models:
    count_selected_models+=1

    one_model_all_folds=selected_model['model'].learners #list of all fold models
    count_folds=0
    pred_one_model_all_folds = None
    repeat_model_num_times=selected_model['repeat']

    print('model: '+str(count_selected_models),'repeat_num_times: '+str(repeat_model_num_times))
    for fold in one_model_all_folds:
        count_folds+=1
        p = fold.model.predict(xgb.DMatrix(X))
        pred_one_model_all_folds = p if pred_one_model_all_folds is None else pred_one_model_all_folds + p

    pred_one_model_all_folds /= float(count_folds)
    pred_one_model_all_folds *= repeat_model_num_times
    models_div_count += repeat_model_num_times

    # add this model to full prediction values
    if pred_one_model_all_folds is not None:
        predictions = pred_one_model_all_folds if predictions is None else predictions + pred_one_model_all_folds

# compute mean
print('models_div_count: ' + str(models_div_count))
predictions /= float(models_div_count)
df_predictions=pd.DataFrame(predictions)
return df_predictions

pplonski commented 4 years ago

Must be a bug. Predictions should be exactly the same.

pplonski commented 3 years ago

@tmontana do you still have this bug?

tmontana commented 3 years ago

I don't. Thank you,

Gabriel Andraos www.gsquaredcapital.comhttp://www.gsquaredcapital.com

[cid:15bb44b0-a6cd-4a5f-a775-c4918aa57d6d]

Use by other than intended recipients is prohibited. This message is provided for information purposes only and should not be construed as an offer to sell or the solicitation of an offer to buy any securities or related financial instruments, as official confirmation of any transaction, or as an official statement of G Squared Capital LLP. We do not represent that this information is complete or accurate and it should not be relied upon as such. Before entering into any transaction you should ensure that you independently determine that it is appropriate for you given your objectives, experience, financial resources, and other relevant circumstances and fully understand its potential risks and rewards. You should consult with your advisors as you deem necessary to make such determinations. No representation or guarantee is made that any indicative performance or return indicated will be achieved in the future. All information is subject to change without notice. G Squared Capital LLP is authorised and regulated by the United Kingdom Financial Conduct Authority (Reference Number 739481).

From: Piotr @.> Sent: Wednesday, April 7, 2021 5:31 PM To: mljar/mljar-supervised @.> Cc: Gabriel Andraos @.>; Mention @.> Subject: Re: [mljar/mljar-supervised] sanity check for predict_proba in classification (#219)

@tmontanahttps://github.com/tmontana do you still have this bug?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/mljar/mljar-supervised/issues/219#issuecomment-815009879, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AELQ6SXM7PW3Q7YBP22O4BTTHR3EHANCNFSM4SBMGF6Q.

pplonski commented 3 years ago

@tmontana thanks for your response. I'm closing the issue.

Just to let you know. We added a new mode Optuna in the AutoML. It is using the Optuna hyperparameters framework for tuning. Sometimes it can find very good models.

The example:

automl = AutoML(
    algorithms=["Xgboost", "Random Forest"],
    mode="Optuna",
    optuna_time_budget=1800, # each algorithm will be tuned for 1800 seconds
    total_time_limit=8*3600
)
automl.fit(X, y)

mljar / mljar-supervised

sanity check for predict_proba in classification #219