sklearn HistGBT - Githubissues

szilard commented 4 years ago

New tool https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html

based on POC https://github.com/ogrisel/pygbm mentioned earlier here https://github.com/szilard/GBM-perf/issues/15

szilard commented 4 years ago

Code to compare HistGBT to lightgbm:

import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")

d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all_cat = preprocessing.OneHotEncoder(categories="auto").fit_transform(d_all[vars_cat])
X_all = sparse.hstack((X_all_cat, d_all[vars_num])).tocsr()
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]

md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

Looks like HistGBT does not support sparse matrices:

Traceback (most recent call last):
  File "run.py", line 43, in <module>
    md.fit(X_train, y_train)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 121, in fit
    X, y = self._validate_data(X, y, dtype=[X_DTYPE],
  File "/usr/local/lib/python3.8/dist-packages/sklearn/base.py", line 432, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 795, in check_X_y
    X = check_array(X, accept_sparse=accept_sparse,
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 575, in check_array
    array = _ensure_sparse_format(array, accept_sparse=accept_sparse,
  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 353, in _ensure_sparse_format
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

szilard commented 4 years ago

Experiments on m5.2xlarge (8 cores, 32GB RAM):

lightgbm:

md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

1.8572165966033936
0.7300012836184978

Will need dense matrices for lightgbm as well for fair comparison to HistGBT:

X_train_DENSE = X_train.toarray()
X_test_DENSE = X_test.toarray()

md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

2.1161305904388428
0.7300012836184978

szilard commented 4 years ago

HistGBT with dense matrices:

md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

the process starts filling up the RAM (slowly) and it's running out of memory (and the OS kills the process):

python3 run.py
Killed

szilard commented 4 years ago

HistGBT will not crash (OOM) with smaller max_leaves:

md = HistGradientBoostingClassifier(max_leaf_nodes=128, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

34.628803730010986
0.731705293267934

which is considerable slower than lightgbm:

md = lgb.LGBMClassifier(num_leaves=128, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

0.948988676071167
0.7334361781282212

szilard commented 4 years ago

Memory usage:

md = lgb.LGBMClassifier(num_leaves=128, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

RAM usage on server:

after loading data, sparse matrix: 0.7 GB
after transforming to dense matrix 1.79 GB
while training lightgbm: 1.80 GB

md = HistGradientBoostingClassifier(max_leaf_nodes=128, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

while training HistGBT: 14.5 GB

For less trees HistGBT:

md = HistGradientBoostingClassifier(max_leaf_nodes=128, learning_rate=0.1, max_iter=10)

6.3 GB

For shallower trees HistGBT:

md = HistGradientBoostingClassifier(max_leaf_nodes=16, learning_rate=0.1, max_iter=100)

3.3 GB

szilard commented 4 years ago

Summary so far:

HistGBT (sklearn's HistGradientBoostingClassifier) currently does not support sparse matrices for encoding categorical variables.

On the airline dataset (with dense matrices) it uses a lot of memory (lightgbm does not use considerable memory other than the data - even with dense matrices). Is this a memory leak? The amount of memory increases with number of trees and the depth of trees. It runs out of memory even for small data/not too deep trees.

It also runs slow compared to lightgbm (both on dense matrices).

Am I doing something wrong? @amueller @ogrisel @laurae2 Is this because of categorical data (vs previous benchmarks were on numeric data)? Can I change something in the code to make it better?

ogrisel commented 4 years ago

HistGBT (sklearn's HistGradientBoostingClassifier) currently does not support sparse matrices for encoding categorical variables.

This is a known limitation. However for categorical variables there are three solutions that do not involve sparse preprocessing of the training data:

use an arbitrary ordinal encoding of the categorical variables (e.g. using OrdinalEncoder): this is counter intuitive but in many cases it's both faster and as accurate than One Hot encoding,
a better alternative would be impact coding for categorical variables (also known as target encoding, under development at scikit-learn/scikit-learn#17323) but I suspect that plain ordinal encoding is enough for GBRT ,
use a native support for categorical variables (under development: scikit-learn/scikit-learn#16909)

On the airline dataset (with dense matrices) it uses a lot of memory (lightgbm does not use considerable memory other than the data - even with dense matrices). Is this a memory leak? The amount of memory increases with number of trees and the depth of trees. It runs out of memory even for small data/not too deep trees.

Do you use scikit-learn master? we recently fixed some cyclic references that prevent the GC to properly release memory in a timely fashion: scikit-learn/scikit-learn#18334

If you want to try the master branch without building from source, feel free to use the nightly builds:

https://scikit-learn.org/0.21/developers/advanced_installation.html#installing-nightly-builds

It also runs slow compared to lightgbm (both on dense matrices).

lightgbm 3.0 is known to be ~2x faster than scikit-learn, probably because of the new row-wise parallelism in histograms:

https://github.com/microsoft/LightGBM/issues/2791#issuecomment-688258910

Also, on hyper-threaded machines, you want to limit the number of threads explicitly with OMP_NUM_THREADS=number_of_physical_cores python benchmark.py to avoid over-subscription issues. We want to do that automatically but this needs a bit of work.

szilard commented 4 years ago

Thanks @ogrisel for very quick answer.

Encoding: yeah I know I could use ordinal or target encoding. In fact there was some discussion on this in 2015 on exactly this dataset, I seemed that ordinal encoding could actually get better AUC (in random forest) than 1-hot . The reason I did 1-hot in the benchmark I guess because it has been (used to be?) the preferred method for practitioners and it was also the common denominator for all packages (e.g. I kept using 1-hot even with lightgbm in the benchmark, even after lightgbm started using "direct" encoding).

So maybe I should try ordinal encoding (and others), but then I should do the same with all the tools. Or at least try it out. But then of course there are so many other things I should also do (e.g. using more datasets of different structure, sparsity etc) to make the benchmark more meaningful. All I managed to do is create a list a while ago.

szilard commented 4 years ago

Thanks @ogrisel for suggestions, I just used the latest release, I'll try out the nightly build and also OMP_NUM_THREADS. Will add results here.

szilard commented 4 years ago

Keeping track of things:

sudo apt install python3-pip
sudo pip3 install -U pandas lightgbm sklearn

Data size in RAM:

import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
import sys

d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")

d_all = pd.concat([d_train,d_test])
d_all = pd.concat([d_train,d_test])
sys.getsizeof(d_all)/1e6

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

sys.getsizeof(d_all)/1e6

X_all_cat = preprocessing.OneHotEncoder(categories="auto").fit_transform(d_all[vars_cat])
X_all = sparse.hstack((X_all_cat, d_all[vars_num])).tocsr()

X_all_cat.data.nbytes/1e6
X_all.data.nbytes/1e6

X_all_DENSE = X_all.toarray()

X_all_DENSE.nbytes/1e6

>>> sys.getsizeof(d_all)/1e6
88.390248
>>> sys.getsizeof(d_all)/1e6
26.000016
>>> X_all_cat.data.nbytes/1e6
9.6
>>> X_all.data.nbytes/1e6
12.8
>>> X_all_DENSE.nbytes/1e6
1102.4

ogrisel commented 4 years ago

I was too curious, here are the results with ordinal encoding:

import pandas as pd
import sklearn
from sklearn import preprocessing
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.compose import ColumnTransformer

from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier

d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")
vars_cat = ["Month", "DayofMonth", "DayOfWeek", "UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime", "Distance"]
input_all = vars_cat + vars_num

ordinal_encoder = preprocessing.OrdinalEncoder(
    handle_unknown="use_encoded_value",
    unknown_value=-1,
)

preprocessor = ColumnTransformer([
        ("cat", ordinal_encoder, vars_cat),
    ],
    remainder="passthrough",
)
X_train = preprocessor.fit_transform(d_train[vars_cat + vars_num])
y_train = (d_train["dep_delayed_15min"] == "Y").values

X_test = preprocessor.transform(d_test[vars_cat + vars_num])
y_test = (d_test["dep_delayed_15min"] == "Y").values

print(f"n_samples={X_train.shape[0]}")
print(f"n_features={X_train.shape[1]}")

print(f"LightGBM {lgb.__version__}:")
md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(f"  - training time: {time.time() - start_time:.3f}s")

y_pred = md.predict_proba(X_test)[:, 1]
print(f"  - ROC AUC: {metrics.roc_auc_score(y_test, y_pred):.3f}")

print(f"scikit-learn {sklearn.__version__}:")
md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1,
                                    max_iter=100)
start_time = time.time()
md.fit(X_train, y_train)
print(f"  - training time: {time.time() - start_time:.3f}s")

y_pred = md.predict_proba(X_test)[:, 1]
print(f"  - ROC AUC: {metrics.roc_auc_score(y_test, y_pred):.3f}")

I set: OMP_NUM_THREADS=4 (this dataset has too few features to really benefit from many threads and both lightgbm and scikit-learn suffer from over-subscription):

n_samples=100000
n_features=8
LightGBM 3.0.0:
  - training time: 1.508s
  - ROC AUC: 0.718
scikit-learn 0.24.dev0:
  - training time: 3.487s
  - ROC AUC: 0.718

So indeed LighGBM 3.0 is a bit more than 2x faster than scikit-learn master but there is some variability on such short runs.

szilard commented 4 years ago

Nice @ogrisel, I was working on that as well, you got there first 👍

It's then weird why on 1-hot encoding HistGBT is 30x slower (see above). It's just another data matrix (though wider) and it has mostly 0s and 1s (though I use dense matrices, so lightgbm does not have a knowledge of that either).

ogrisel commented 4 years ago

It's expected for high cardinality features: it has a lot more work to do to build the histograms for the expanded features and for each such feature we treat the majority of 0s as any other value. LightGBM is very smart at handling sparse features but this is not (yet) implemented in scikit-learn.

szilard commented 4 years ago

Here's my version (with LabelEncoder, pardon me, I'm mainly an R guy LOL):

import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")

d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all = d_all[vars_num+vars_cat].values
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]

md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

Timings:

LGBMClassifier(num_leaves=512)
>>> print(time.time() - start_time)
1.4476277828216553
>>> print(metrics.roc_auc_score(y_test, y_pred))
0.7177775781298882

HistGradientBoostingClassifier(max_leaf_nodes=512)
>>> print(time.time() - start_time)
3.791149616241455
>>> print(metrics.roc_auc_score(y_test, y_pred))
0.7164761138428702

szilard commented 4 years ago

It's expected for high cardinality features: it has a lot more work to do to build the histograms for the expanded features and for each such feature we treat the majority of 0s as any other value. LightGBM is very smart at handling sparse features but this is not (yet) implemented in scikit-learn.

I used dense matrices with lightgbm as well, are you suggesting lightgbm might do a "preprocessing" step to lump the 0s into a bin and that does not need to be repeated over and over the iterations or something like that?

ogrisel commented 4 years ago

I used dense matrices with lightgbm as well, are you suggesting lightgbm might do a "preprocessing" step to lump the 0s into a bin and that does not need to be repeated over and over the iterations or something like that?

Indeed I did not get that. LightGBM might be clever enough to automatically detect sparse patterns even in dense arrays. Futhermore, if it detects sparsity patterns it might also benefit from Exclusive Feature Bundling https://lightgbm.readthedocs.io/en/latest/Parameters.html#enable_bundle which we do not implement in scikit-learn at the moment.

NicolasHug commented 4 years ago

Lightgbm does feature bundling for features that are mutually exclusive as is the case for OHEd features.

Lol @ogrisel is a few seconds faster as usual

NicolasHug commented 4 years ago

@szilard we have an internal sklearn.ensemble._hist_gradient_boosting.utils.get_equivalent_model in sklearn to have fairer comparisons between models (we deactivate feature bundling here), in case you're curious about various potential discrepancies. This is mostly to make sure we get equivalent predictions, obviously using it for benchmarks purposes would not be fair to LightGBM and others because we deactivate some advanced fancy stuff that isn't yet implemented in sklearn

ogrisel commented 4 years ago

I am not sure that deactivating feature bundling can be considered "fair" for end-user facing benchmarks. But it's useful for us to debug the scikit-learn performance and treat one problem at a time though.

szilard commented 4 years ago

Thanks @ogrisel and @NicolasHug for insights. I will look at a few things, will post all findings here.

szilard commented 4 years ago

Using nightly builds:

sudo apt install python3-pip
sudo pip3 install -U pandas lightgbm sklearn
sudo pip3 install -U --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple scikit-learn

    Found existing installation: scikit-learn 0.23.2
    Uninstalling scikit-learn-0.23.2:
      Successfully uninstalled scikit-learn-0.23.2
Successfully installed scikit-learn-0.24.dev0

Running original 1-hot encoding, dense matrices (both lighgbm and HistGBT):

import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")

d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all_cat = preprocessing.OneHotEncoder(categories="auto").fit_transform(d_all[vars_cat])
X_all = sparse.hstack((X_all_cat, d_all[vars_num])).tocsr()
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]

X_train_DENSE = X_train.toarray()
X_test_DENSE = X_test.toarray()

md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

Great thing @ogrisel, the memory issue is fixed (looks like it was that memory leakage you mentioned). Now code above does not crash on 32 GB, memory usage is:

after loading data, create (dense) matrices: 1.71 GB
lightgbm training: 1.79 GB
HistGBT: 3.1 GB

So HistGBT is still using some memory (vs lightgbm using very little), but now it's much better than before. Thanks @ogrisel and @amueller for suggesting to use the dev version (master/nightly builds).

szilard commented 4 years ago

Notwithstanding the other encoding options, if we look at 1-hot encoding:

Run time [sec] and AUC:

data size 100K:

lightgbm sparse
1.8356289863586426
0.7300012836184978
lightgbm dense
2.1029934883117676
0.7300012836184978
HistGBT dense
35.00905084609985
0.728212555455098

data size 1M:

lightgbm sparse
4.715407609939575
0.764772836574283
lightgbm dense
6.551331520080566
0.764772836574283
HistGBT dense
170.7247931957245
0.7655385730787526

data size 10M:

Cannot create the dense matrix with 32 GB RAM. Will need a bigger cloud instance. (TODO)

Code:

import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-10m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")

d_all = pd.concat([d_train,d_test])

vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
  d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])

X_all_cat = preprocessing.OneHotEncoder(categories="auto").fit_transform(d_all[vars_cat])
X_all = sparse.hstack((X_all_cat, d_all[vars_num])).tocsr()
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)

X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]

X_train_DENSE = X_train.toarray()
X_test_DENSE = X_test.toarray()

print("lightgbm sparse")

md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

print("lightgbm dense")

md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

print("HistGBT dense")

md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)

y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

ogrisel commented 4 years ago

The Exclusive Feature Bundling feature of LGBM is really efficient to deal with OHE categorical variables. However I still think that OHE is useless for decision trees based algorithms and ordinal encoding (or native categorical variable support) are better options.

szilard / GBM-perf

sklearn HistGBT #36