LightGBM feature selection and low model size

ninenerd commented 4 years ago

So I build a Random Forest regression model earlier, model size was around 200mb with HPT and pruning.

Now that I'm trying with lightGBM, a regression model, without HPT I am getting same high accuracy (rmse). I used default parameters only . Is lightGBM so effective compared to RF ?
Size of saved lightGBM model is 250kb, does that sound right for typical lightGBM model, or I am making some mistake ?
Is there a way for feature selection in lightGBM (top features wrt to target) ?

Below is the sample code

X = dataset.data; y = dataset.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

model = ltb.LGBMRegressor()
model.fit(X_train, y_train)
print(); print(model)

expected_y  = y_test
predicted_y = model.predict(X_test)

print(metrics.r2_score(expected_y, predicted_y))
print(metrics.mean_squared_log_error(expected_y, predicted_y))

filename = 'lightgbm_model.sav'
pickle.dump(model, open(filename, 'wb'))

New to this model, hence questions.

jameslamb commented 3 years ago

Hi @ninenerd , thanks for using LightGBM! Apologies for our delayed response.

Now that I'm trying with lightGBM, a regression model, without HPT I am getting same high accuracy (rmse). I used default parameters only . Is lightGBM so effective compared to RF ?

Performance is dependent on the problem and data, so we can't make a broad claim about LightGBM always being more effective than any other technique. LightGBM has been used in many winning solutions for machine learning competitions (https://github.com/microsoft/LightGBM/tree/c20cce0474ceba0a3239e09782f50aed1050bc38/examples#machine-learning-challenge-winning-solutions).

If you want to go very deep into how LightGBM achieves good performance in some tasks, you can see the original LightGBM paper.

Size of saved lightGBM model is 250kb, does that sound right for typical lightGBM model, or I am making some mistake ?

Are you doing work where the size of the model on disk is very very important? Like deploying to a very small machine? If not, I would not worry too much about this. 250kb does not sound very large to me. If that size is a problem for you, please give use more details in a separate issue.

Is there a way for feature selection in LightGBM (top features wrt to target) ?

Like other tree-based supervised learning models, LightGBM has feature selection built into it. Features that don't help explain the target will never be chosen for splits.

If you want to examine feature importance after a trained model, see model.feature_importances. You can also explore plot_importances() (https://github.com/microsoft/LightGBM/blob/c20cce0474ceba0a3239e09782f50aed1050bc38/python-package/lightgbm/plotting.py#L29)

import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(
    return_X_y=True,
    as_frame=True
)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    random_state=42
)

gbm = lgb.LGBMClassifier(
    n_estimators=50,
    colnames=list(X.columns)
)

gbm.fit(
    X_train,
    y_train,
    eval_set=[(X_test, y_test)],
    early_stopping_rounds=5,
    verbose=False,
)

print(gbm.feature_importances_)

from lightgbm.plotting import plot_importance
plot_importance(gbm, color='r', title='t', xlabel='x', ylabel='y')

jameslamb commented 3 years ago

I'm going to close this issue, because we try to use the Issues list to track work that needs to be done, like fixing bugs and adding new features. If you want to get more familiar with LightGBM, please consult the documentation. If you find that something is missing from the documentation that could help you, please open a new issue and request it or open a pull request that proposes documentation changes.

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

microsoft / LightGBM

LightGBM feature selection and low model size #3511