mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.05k stars 408 forks source link

total_time_limit is not respected when doing feature selection #426

Closed huanvo88 closed 1 year ago

huanvo88 commented 3 years ago

Hello,

I am trying to run mljar with MODE = Compete for 4 hours with custom validation set. Here is my code

automl = AutoML(mode = 'Compete', n_jobs = 40, total_time_limit = 4*3600, eval_metric = 'accuracy', validation_strategy={'validation_type': 'custom'}) automl.fit(X,y, cv = cv)

However it has been run for more than 4 hours. In the log file it seems to be stuck at the following step:

There is no extra output after that, and the script has been stuck on that for more than 4 hours. Do you know what is going on?

pplonski commented 3 years ago

@huanvo88 it looks like the problem is with feature selection step, the process of computing feature importance can be very time consuming because it is permutation based, so if your dataset has many columns (more than 1000) it can easily stuck there.

For now the easiest solution might be to set features_selection=False in the AutoML() constructor.

How big is your datset? How many rows and columns? Can you post here the leaderboard table from your readme file, I'm curious which model was selected for feature selection and how much time is needed to train it?

huanvo88 commented 3 years ago

@pplonski Indeed my dataset has a lot of features: it has only 1000 rows but close to 800 columns (the embedding vectors from a transformer model).

My leaderboad table contains a lot of models. The best ones are the usual xgboost, lightgbm, catboost, and each one takes about 20-30 seconds, which is a bit slow considering that I set the n_jobs = 40 (on my machine using the python API of lightgbm it is very fast). I suppose it is not also due to cv because I specify a custom cv. For the feature selection because it took to long I just killed it, so it did not return any output.

I will try mode = Compete with feature_selection = False and see how it goes. I tried the Optuna mode (by default without feature selection) and it worked fine.

I think in the future when we specify total_time_limit there should be a mechanism to halt the feature selection part when the time limit is exceeded, or potentially not running it?

Has there been any ablation study about whether random_feature_selection improves model performance? Or if permutation importance takes a long time, you can use the default feature importance or SHAP values from Xgboost? I think feature selection might be useful for linear models, but maybe not so much for gbdt models.

twhelan22 commented 2 years ago

I'd also like to see feature selection use a different method by default, ideally SHAP: https://towardsdatascience.com/stop-permuting-features-c1412e31b63f

pplonski commented 2 years ago

@twhelan22 from my experience the SHAP was very computationally expensive - to get any result from SHAP I need to run on a small subsample of data.

EvanHong99 commented 1 year ago

@pplonski Indeed my dataset has a lot of features: it has only 1000 rows but close to 800 columns (the embedding vectors from a transformer model).

My leaderboad table contains a lot of models. The best ones are the usual xgboost, lightgbm, catboost, and each one takes about 20-30 seconds, which is a bit slow considering that I set the n_jobs = 40 (on my machine using the python API of lightgbm it is very fast). I suppose it is not also due to cv because I specify a custom cv. For the feature selection because it took to long I just killed it, so it did not return any output.

I will try mode = Compete with feature_selection = False and see how it goes. I tried the Optuna mode (by default without feature selection) and it worked fine.

I think in the future when we specify total_time_limit there should be a mechanism to halt the feature selection part when the time limit is exceeded, or potentially not running it?

Has there been any ablation study about whether random_feature_selection improves model performance? Or if permutation importance takes a long time, you can use the default feature importance or SHAP values from Xgboost? I think feature selection might be useful for linear models, but maybe not so much for gbdt models.

I faced a similar problem at Step default_algorithms will try to check up to 1 model. My data has only 2000 rows and 190 features but it stuck at this step at least for 5 mins, no matter which model I selected (random_forest/lgbm/catboost/nn). But the package "LightGBM" can finish calculating at a quite short time (within 5 sec).

This is my setting:

    automl = AutoML(results_path=model_root+f'period{num}',
                    total_time_limit=120,
                    model_time_limit=20,
                    mode='Perform',
                    ml_task='regression',
                    algorithms=[
                        "Baseline",
                        # "Linear",
                        # "Random Forest",
                        # "Extra Trees",
                        "LightGBM",
                        # "CatBoost",
                        # "Neural Network"
                                ],
                    train_ensemble=True,
                    eval_metric='rmse',
                    explain_level=1,
                    golden_features = False,
                    random_state=0
                    )

This is the log:

******************** start 0 ********************
AutoML directory: ./models/period0
The task is regression with evaluation metric rmse
AutoML will use algorithms: ['Baseline', 'LightGBM']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'not_so_random', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'ensemble']
* Step simple_algorithms will try to check up to 1 model
1_Baseline rmse 0.000448 trained in 2.41 seconds (1-sample predict time 0.109 seconds)
* Step default_algorithms will try to check up to 1 model

My data:

X_train.csv y_train.csv

If there is any bug causes the slow performance? Or it's just the problem of my PC? Btw I'm not sure where the param "total_time_limit" and "model_time_limit" will work, at least not the step I was stuck at.

pplonski commented 1 year ago

Hi @EvanHong99,

Here is explanation of feature selection procedure https://mljar.com/automated-machine-learning/feature-selection/ It should take similar amount of time needed to train normal model.

Could you please provide full training log?

EvanHong99 commented 1 year ago

@pplonski The log pasted is the full log, because I was stuck at the last line (* Step default_algorithms will try to check up to 1 model) and my cpu kept running without giving the result for about 5min. It's already two times longer than the total_time_limit.

Finally, I have to use FLAML instead.

But still, a lot of hanks for your generous reply.