mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.05k stars 408 forks source link

AutoMLException: Missing column: feature_1 in input data. Cannot predict #589

Closed nosacapital closed 1 year ago

nosacapital commented 1 year ago

I am currently trying out mljar on Numerai data. The data is quite large and trying to run a training operation brings about the Kernel dying. By the way, the kernel I am using is the Python 3.9.13. One message I got at the start of any training operation was this: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features. Numerical issues were encountered when scaling the data and might not be solved. The standard deviation of the data is probably very close to 0.

I then scaled the data as well as run a PCA to reduce the number of features. I still got the message above but the training ran to its conclusion. I have tried to use shorter model run times to ensure I am on the right path before committing to longer and more comprehensive runs. On achieving the training objective, I tried to test for prediction. On running the code I get the AutoML Exception error in title of message. Here is the code as follows:

from supervised.automl import AutoML
import pandas as pd
from numerapi import NumerAPI
import time
from pathlib import Path

napi = NumerAPI()

current_round = napi.get_current_round()

start = time.time()

TARGET_NAME = f"target"
PREDICTION_NAME = f"prediction"

training_data = pd.read_parquet('v4/train.parquet')
validation_data = pd.read_parquet('v4/validation.parquet')
live_data = pd.read_parquet(f'v4/live_{current_round}.parquet')

# Sampling for ease of demo/repeatability - obviously would use full set for competition
# training_data = training_data.sample(3000)
# live_data = live_data.sample(3000)

feature_names = [f for f in training_data.columns if f.startswith("feature") or f  == TARGET_NAME]

# Load AutoML model
automl = AutoML(results_path='AutoML_8')
live_data[PREDICTION_NAME] = automl.predict(live_data[feature_names[:-1]])
2022-12-06 22:05:19,443 supervised.exceptions ERROR Missing column: feature_1 in input data. Cannot predict
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?bc2078e7-cb44-46ac-87b7-d4d165002614)
---------------------------------------------------------------------------
AutoMLException                           Traceback (most recent call last)
Cell In [10], line 3
      1 # Load AutoML model
      2 automl = AutoML(results_path='AutoML_8')
----> 3 live_data[PREDICTION_NAME] = automl.predict(live_data[feature_names[:-1]])

File /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/supervised/automl.py:387, in AutoML.predict(self, X)
    370 def predict(self, X: Union[List, numpy.ndarray, pandas.DataFrame]) -> numpy.ndarray:
    371     """
    372     Computes predictions from AutoML best model.
    373 
   (...)
    385         AutoMLException: Model has not yet been fitted.
    386     """
--> 387     return self._predict(X)

File /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/supervised/base_automl.py:1361, in BaseAutoML._predict(self, X)
   1359 def _predict(self, X):
-> 1361     predictions = self._base_predict(X)
   1362     # Return predictions
   1363     # If classification task the result is in column 'label'
   1364     # If regression task the result is in column 'prediction'
   1365     return (
   1366         predictions["label"].to_numpy()
...
   1310         )
   1312 X = X[self._data_info["columns"]]
   1313 self._validate_X_predict(X)

AutoMLException: Missing column: feature_1 in input data. Cannot predict

Please assist in this matter. Thanks

pplonski commented 1 year ago

Hi @nosacapital,

Thank you for reporting the issue. Please try to increase the total_time_limit value, the default is 3600 seconds, and this might not be enough for a large dataset.

nosacapital commented 1 year ago

@pplonski

Thank you for your response, and just letting you know that what you suggested worked! And yes, thank you for a wonderful library, most appreciated.