vanderschaarlab / autoprognosis

A system for automating the design of predictive modeling pipelines tailored for clinical prognosis.
https://www.autoprognosis.vanderschaar-lab.com/
Apache License 2.0
114 stars 26 forks source link

Predictions for new data #39

Closed MassimilianoGrassiDataScience closed 1 year ago

MassimilianoGrassiDataScience commented 1 year ago

I was trying Autoprognosis, and I was able to develop the model successfully. Now I want to apply it to new data. I loaded the model following the tutorial and then I (naively?) used .predict_proba, but it did not work.

With different attempts, including re-developing the model with different data, it always resulted in errors, with different errors for different attempts.

E.g., the latest error is This QuantileTransformer instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.. In this case, the model is: {'models': [<autoprognosis.plugins.pipeline.nop_normal_transform_catboost at 0x16f4b31f0>], 'weights': [0.9999999900000002], 'explainer_plugins': [], 'explanations_nepoch': 10000, 'explainers': None}

What should I do between load_model_from_file(model_path) and .predict_proba?

Thanks!

MassimilianoGrassiDataScience commented 1 year ago

Just to provide another example, I have the error 'SimpleClassifierAggregator' object has no attribute '_classes' with the following model: {'models': [<autoprognosis.plugins.pipeline.fast_ica_uniform_transform_lda at 0x173c63400>, <autoprognosis.plugins.pipeline.nop_scaler_random_forest at 0x173c316d0>, <autoprognosis.plugins.pipeline.variance_threshold_scaler_catboost at 0x173b1edf0>], 'method': 'average', 'explainer_plugins': [], 'explanations_nepoch': 10000, 'clf': <autoprognosis.plugins.ensemble.combos.SimpleClassifierAggregator at 0x173b1eee0>}

bcebere commented 1 year ago

Hello @MassimilianoGrassiDataScience

AutoPrognosis selects a model architecture, and saves that without training the model. You get the architecture, and you can run your own benchmarks on different folds.

The main README contains such an example

from pathlib import Path

from sklearn.datasets import load_breast_cancer

from autoprognosis.studies.classifiers import ClassifierStudy
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_estimator

X, Y = load_breast_cancer(return_X_y=True, as_frame=True)

df = X.copy()
df["target"] = Y

workspace = Path("workspace")
study_name = "example"

study = ClassifierStudy(
    study_name=study_name,
    dataset=df,  # pandas DataFrame
    target="target",  # the label column in the dataset
    num_iter=100,  # how many trials to do for each candidate
    timeout=60,  # seconds
    classifiers=["logistic_regression", "lda", "qda"],
    workspace=workspace,
)

study.run()

output = workspace / study_name / "model.p"
model = load_model_from_file(output)

# <model> contains the optimal architecture, but the model is not trained yet. You need to call fit() to use it.
# This way, we can further benchmark the selected model on the training set.
metrics = evaluate_estimator(model, X, Y)

print(f"model {model.name()} -> {metrics['clf']}")

# Train the model
model.fit(X, Y)

# Predict the probabilities of each class using the model
model.predict_proba(X)

As you can see, before the predict_proba call, you need to call fit on your data, even if it is the same data you used for conducting the model search.

We will improve the error message in the future.

Let me know if this fixes your problem.

MassimilianoGrassiDataScience commented 1 year ago

Thank you! It works now!