Open szilard opened 4 years ago
Code to compare HistGBT to lightgbm:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")
d_all = pd.concat([d_train,d_test])
vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])
X_all_cat = preprocessing.OneHotEncoder(categories="auto").fit_transform(d_all[vars_cat])
X_all = sparse.hstack((X_all_cat, d_all[vars_num])).tocsr()
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)
X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]
md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
Looks like HistGBT does not support sparse matrices:
Traceback (most recent call last):
File "run.py", line 43, in <module>
md.fit(X_train, y_train)
File "/usr/local/lib/python3.8/dist-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 121, in fit
X, y = self._validate_data(X, y, dtype=[X_DTYPE],
File "/usr/local/lib/python3.8/dist-packages/sklearn/base.py", line 432, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 72, in inner_f
return f(**kwargs)
File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 795, in check_X_y
X = check_array(X, accept_sparse=accept_sparse,
File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 72, in inner_f
return f(**kwargs)
File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 575, in check_array
array = _ensure_sparse_format(array, accept_sparse=accept_sparse,
File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 353, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
Experiments on m5.2xlarge (8 cores, 32GB RAM):
lightgbm:
md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
1.8572165966033936
0.7300012836184978
Will need dense matrices for lightgbm as well for fair comparison to HistGBT:
X_train_DENSE = X_train.toarray()
X_test_DENSE = X_test.toarray()
md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
2.1161305904388428
0.7300012836184978
HistGBT with dense matrices:
md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
the process starts filling up the RAM (slowly) and it's running out of memory (and the OS kills the process):
python3 run.py
Killed
HistGBT will not crash (OOM) with smaller max_leaves:
md = HistGradientBoostingClassifier(max_leaf_nodes=128, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
34.628803730010986
0.731705293267934
which is considerable slower than lightgbm:
md = lgb.LGBMClassifier(num_leaves=128, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
0.948988676071167
0.7334361781282212
Memory usage:
md = lgb.LGBMClassifier(num_leaves=128, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
RAM usage on server:
md = HistGradientBoostingClassifier(max_leaf_nodes=128, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
For less trees HistGBT:
md = HistGradientBoostingClassifier(max_leaf_nodes=128, learning_rate=0.1, max_iter=10)
6.3 GB
For shallower trees HistGBT:
md = HistGradientBoostingClassifier(max_leaf_nodes=16, learning_rate=0.1, max_iter=100)
3.3 GB
Summary so far:
HistGBT (sklearn's HistGradientBoostingClassifier) currently does not support sparse matrices for encoding categorical variables.
On the airline dataset (with dense matrices) it uses a lot of memory (lightgbm does not use considerable memory other than the data - even with dense matrices). Is this a memory leak? The amount of memory increases with number of trees and the depth of trees. It runs out of memory even for small data/not too deep trees.
It also runs slow compared to lightgbm (both on dense matrices).
Am I doing something wrong? @amueller @ogrisel @laurae2 Is this because of categorical data (vs previous benchmarks were on numeric data)? Can I change something in the code to make it better?
HistGBT (sklearn's HistGradientBoostingClassifier) currently does not support sparse matrices for encoding categorical variables.
This is a known limitation. However for categorical variables there are three solutions that do not involve sparse preprocessing of the training data:
On the airline dataset (with dense matrices) it uses a lot of memory (lightgbm does not use considerable memory other than the data - even with dense matrices). Is this a memory leak? The amount of memory increases with number of trees and the depth of trees. It runs out of memory even for small data/not too deep trees.
Do you use scikit-learn master? we recently fixed some cyclic references that prevent the GC to properly release memory in a timely fashion: scikit-learn/scikit-learn#18334
If you want to try the master branch without building from source, feel free to use the nightly builds:
https://scikit-learn.org/0.21/developers/advanced_installation.html#installing-nightly-builds
It also runs slow compared to lightgbm (both on dense matrices).
lightgbm 3.0 is known to be ~2x faster than scikit-learn, probably because of the new row-wise parallelism in histograms:
https://github.com/microsoft/LightGBM/issues/2791#issuecomment-688258910
Also, on hyper-threaded machines, you want to limit the number of threads explicitly with OMP_NUM_THREADS=number_of_physical_cores python benchmark.py
to avoid over-subscription issues. We want to do that automatically but this needs a bit of work.
Thanks @ogrisel for very quick answer.
Encoding: yeah I know I could use ordinal or target encoding. In fact there was some discussion on this in 2015 on exactly this dataset, I seemed that ordinal encoding could actually get better AUC (in random forest) than 1-hot . The reason I did 1-hot in the benchmark I guess because it has been (used to be?) the preferred method for practitioners and it was also the common denominator for all packages (e.g. I kept using 1-hot even with lightgbm in the benchmark, even after lightgbm started using "direct" encoding).
So maybe I should try ordinal encoding (and others), but then I should do the same with all the tools. Or at least try it out. But then of course there are so many other things I should also do (e.g. using more datasets of different structure, sparsity etc) to make the benchmark more meaningful. All I managed to do is create a list a while ago.
Thanks @ogrisel for suggestions, I just used the latest release, I'll try out the nightly build and also OMP_NUM_THREADS. Will add results here.
Keeping track of things:
sudo apt install python3-pip
sudo pip3 install -U pandas lightgbm sklearn
Data size in RAM:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
import sys
d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")
d_all = pd.concat([d_train,d_test])
d_all = pd.concat([d_train,d_test])
sys.getsizeof(d_all)/1e6
vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])
sys.getsizeof(d_all)/1e6
X_all_cat = preprocessing.OneHotEncoder(categories="auto").fit_transform(d_all[vars_cat])
X_all = sparse.hstack((X_all_cat, d_all[vars_num])).tocsr()
X_all_cat.data.nbytes/1e6
X_all.data.nbytes/1e6
X_all_DENSE = X_all.toarray()
X_all_DENSE.nbytes/1e6
>>> sys.getsizeof(d_all)/1e6
88.390248
>>> sys.getsizeof(d_all)/1e6
26.000016
>>> X_all_cat.data.nbytes/1e6
9.6
>>> X_all.data.nbytes/1e6
12.8
>>> X_all_DENSE.nbytes/1e6
1102.4
I was too curious, here are the results with ordinal encoding:
import pandas as pd
import sklearn
from sklearn import preprocessing
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")
vars_cat = ["Month", "DayofMonth", "DayOfWeek", "UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime", "Distance"]
input_all = vars_cat + vars_num
ordinal_encoder = preprocessing.OrdinalEncoder(
handle_unknown="use_encoded_value",
unknown_value=-1,
)
preprocessor = ColumnTransformer([
("cat", ordinal_encoder, vars_cat),
],
remainder="passthrough",
)
X_train = preprocessor.fit_transform(d_train[vars_cat + vars_num])
y_train = (d_train["dep_delayed_15min"] == "Y").values
X_test = preprocessor.transform(d_test[vars_cat + vars_num])
y_test = (d_test["dep_delayed_15min"] == "Y").values
print(f"n_samples={X_train.shape[0]}")
print(f"n_features={X_train.shape[1]}")
print(f"LightGBM {lgb.__version__}:")
md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(f" - training time: {time.time() - start_time:.3f}s")
y_pred = md.predict_proba(X_test)[:, 1]
print(f" - ROC AUC: {metrics.roc_auc_score(y_test, y_pred):.3f}")
print(f"scikit-learn {sklearn.__version__}:")
md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1,
max_iter=100)
start_time = time.time()
md.fit(X_train, y_train)
print(f" - training time: {time.time() - start_time:.3f}s")
y_pred = md.predict_proba(X_test)[:, 1]
print(f" - ROC AUC: {metrics.roc_auc_score(y_test, y_pred):.3f}")
I set: OMP_NUM_THREADS=4
(this dataset has too few features to really benefit from many threads and both lightgbm and scikit-learn suffer from over-subscription):
n_samples=100000
n_features=8
LightGBM 3.0.0:
- training time: 1.508s
- ROC AUC: 0.718
scikit-learn 0.24.dev0:
- training time: 3.487s
- ROC AUC: 0.718
So indeed LighGBM 3.0 is a bit more than 2x faster than scikit-learn master but there is some variability on such short runs.
Nice @ogrisel, I was working on that as well, you got there first 👍
It's then weird why on 1-hot encoding HistGBT is 30x slower (see above). It's just another data matrix (though wider) and it has mostly 0s and 1s (though I use dense matrices, so lightgbm does not have a knowledge of that either).
It's expected for high cardinality features: it has a lot more work to do to build the histograms for the expanded features and for each such feature we treat the majority of 0s as any other value. LightGBM is very smart at handling sparse features but this is not (yet) implemented in scikit-learn.
Here's my version (with LabelEncoder, pardon me, I'm mainly an R guy LOL):
import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")
d_all = pd.concat([d_train,d_test])
vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])
X_all = d_all[vars_num+vars_cat].values
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)
X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]
md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
Timings:
LGBMClassifier(num_leaves=512)
>>> print(time.time() - start_time)
1.4476277828216553
>>> print(metrics.roc_auc_score(y_test, y_pred))
0.7177775781298882
HistGradientBoostingClassifier(max_leaf_nodes=512)
>>> print(time.time() - start_time)
3.791149616241455
>>> print(metrics.roc_auc_score(y_test, y_pred))
0.7164761138428702
It's expected for high cardinality features: it has a lot more work to do to build the histograms for the expanded features and for each such feature we treat the majority of 0s as any other value. LightGBM is very smart at handling sparse features but this is not (yet) implemented in scikit-learn.
I used dense matrices with lightgbm as well, are you suggesting lightgbm might do a "preprocessing" step to lump the 0s into a bin and that does not need to be repeated over and over the iterations or something like that?
I used dense matrices with lightgbm as well, are you suggesting lightgbm might do a "preprocessing" step to lump the 0s into a bin and that does not need to be repeated over and over the iterations or something like that?
Indeed I did not get that. LightGBM might be clever enough to automatically detect sparse patterns even in dense arrays. Futhermore, if it detects sparsity patterns it might also benefit from Exclusive Feature Bundling https://lightgbm.readthedocs.io/en/latest/Parameters.html#enable_bundle which we do not implement in scikit-learn at the moment.
Lightgbm does feature bundling for features that are mutually exclusive as is the case for OHEd features.
Lol @ogrisel is a few seconds faster as usual
@szilard we have an internal sklearn.ensemble._hist_gradient_boosting.utils.get_equivalent_model
in sklearn to have fairer comparisons between models (we deactivate feature bundling here), in case you're curious about various potential discrepancies.
This is mostly to make sure we get equivalent predictions, obviously using it for benchmarks purposes would not be fair to LightGBM and others because we deactivate some advanced fancy stuff that isn't yet implemented in sklearn
I am not sure that deactivating feature bundling can be considered "fair" for end-user facing benchmarks. But it's useful for us to debug the scikit-learn performance and treat one problem at a time though.
Thanks @ogrisel and @NicolasHug for insights. I will look at a few things, will post all findings here.
Using nightly builds:
sudo apt install python3-pip
sudo pip3 install -U pandas lightgbm sklearn
sudo pip3 install -U --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple scikit-learn
Found existing installation: scikit-learn 0.23.2
Uninstalling scikit-learn-0.23.2:
Successfully uninstalled scikit-learn-0.23.2
Successfully installed scikit-learn-0.24.dev0
Running original 1-hot encoding, dense matrices (both lighgbm and HistGBT):
import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")
d_all = pd.concat([d_train,d_test])
vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])
X_all_cat = preprocessing.OneHotEncoder(categories="auto").fit_transform(d_all[vars_cat])
X_all = sparse.hstack((X_all_cat, d_all[vars_num])).tocsr()
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)
X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]
X_train_DENSE = X_train.toarray()
X_test_DENSE = X_test.toarray()
md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
Great thing @ogrisel, the memory issue is fixed (looks like it was that memory leakage you mentioned). Now code above does not crash on 32 GB, memory usage is:
So HistGBT is still using some memory (vs lightgbm using very little), but now it's much better than before. Thanks @ogrisel and @amueller for suggesting to use the dev version (master/nightly builds).
Notwithstanding the other encoding options, if we look at 1-hot encoding:
Run time [sec] and AUC:
data size 100K:
lightgbm sparse
1.8356289863586426
0.7300012836184978
lightgbm dense
2.1029934883117676
0.7300012836184978
HistGBT dense
35.00905084609985
0.728212555455098
data size 1M:
lightgbm sparse
4.715407609939575
0.764772836574283
lightgbm dense
6.551331520080566
0.764772836574283
HistGBT dense
170.7247931957245
0.7655385730787526
data size 10M:
Cannot create the dense matrix with 32 GB RAM. Will need a bigger cloud instance. (TODO)
Code:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from scipy import sparse
from sklearn import metrics
import time
import lightgbm as lgb
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
d_train = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/train-10m.csv")
d_test = pd.read_csv("https://s3.amazonaws.com/benchm-ml--main/test.csv")
d_all = pd.concat([d_train,d_test])
vars_cat = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
vars_num = ["DepTime","Distance"]
for col in vars_cat:
d_all[col] = preprocessing.LabelEncoder().fit_transform(d_all[col])
X_all_cat = preprocessing.OneHotEncoder(categories="auto").fit_transform(d_all[vars_cat])
X_all = sparse.hstack((X_all_cat, d_all[vars_num])).tocsr()
y_all = np.where(d_all["dep_delayed_15min"]=="Y",1,0)
X_train = X_all[0:d_train.shape[0],]
y_train = y_all[0:d_train.shape[0]]
X_test = X_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0]),]
y_test = y_all[d_train.shape[0]:(d_train.shape[0]+d_test.shape[0])]
X_train_DENSE = X_train.toarray()
X_test_DENSE = X_test.toarray()
print("lightgbm sparse")
md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
print("lightgbm dense")
md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
print("HistGBT dense")
md = HistGradientBoostingClassifier(max_leaf_nodes=512, learning_rate=0.1, max_iter=100)
start_time = time.time()
md.fit(X_train_DENSE, y_train)
print(time.time() - start_time)
y_pred = md.predict_proba(X_test_DENSE)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
The Exclusive Feature Bundling feature of LGBM is really efficient to deal with OHE categorical variables. However I still think that OHE is useless for decision trees based algorithms and ordinal encoding (or native categorical variable support) are better options.
New tool https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html
based on POC https://github.com/ogrisel/pygbm mentioned earlier here https://github.com/szilard/GBM-perf/issues/15