Open szilard opened 3 years ago
Setup for now (WIP):
sudo docker run --rm -ti -p 8787:8787 continuumio/anaconda3 /bin/bash
pip3 install -U lightgbm
## for dask lightgm for now use this:
wget https://raw.githubusercontent.com/jameslamb/talks/main/recent-developments-in-lightgbm/Dockerfile
sudo docker build -t dasklgbm .
sudo docker run --rm -p 8787:8787 dasklgbm
sudo docker ps -a
sudo docker exec -ti ... /bin/bash
ipython
Plain lightgbm:
import pandas as pd
import numpy as np
from sklearn import metrics
import lightgbm as lgb
d_train = pd.read_csv("https://raw.githubusercontent.com/szilard/benchm-ml--data/master/int_enc/train-1m-intenc.csv")
d_test = pd.read_csv("https://raw.githubusercontent.com/szilard/benchm-ml--data/master/int_enc/test-1m-intenc.csv")
X_train = d_train.iloc[:, :-1].to_numpy()
y_train = d_train.iloc[:,-1:].to_numpy()
X_test = d_test.iloc[:, :-1].to_numpy()
y_test = d_test.iloc[:,-1:].to_numpy()
md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
%time md.fit(X_train, y_train)
y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
Results:
Wall time: 3.72 s
0.7636986921602019
Dask lightgbm:
import pandas as pd
from sklearn import metrics
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
from lightgbm.dask import DaskLGBMClassifier
cluster = LocalCluster(n_workers=16, threads_per_worker=1)
client = Client(cluster)
d_train = pd.read_csv("https://raw.githubusercontent.com/szilard/benchm-ml--data/master/int_enc/train-1m-intenc.csv")
d_test = pd.read_csv("https://raw.githubusercontent.com/szilard/benchm-ml--data/master/int_enc/test-1m-intenc.csv")
dx_train = dd.from_pandas(d_train, npartitions=16)
dx_test = dd.from_pandas(d_test, npartitions=1)
X_train = dx_train.iloc[:, :-1].to_dask_array(lengths=True)
y_train = dx_train.iloc[:,-1:].to_dask_array(lengths=True)
X_test = dx_test.iloc[:, :-1].to_dask_array(lengths=True)
y_test = dx_test.iloc[:,-1:].to_dask_array(lengths=True)
X_train.persist()
y_train.persist()
client.has_what()
md = DaskLGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100, tree_learner="data", silent=False)
%time md.fit( client=client, X=X_train, y=y_train)
md_loc = md.to_local()
X_test_loc = X_test.compute()
y_pred = md_loc.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))
With 16 workers, 1 thread each it's terribly slow. Filed bug report here: https://github.com/microsoft/LightGBM/issues/3797#issuecomment-764566963
Changing number of workers, threads, partitions:
n_workers | n_threads | n_partitions | Time (sec) | AUC |
---|---|---|---|---|
16 | 1 | 16 | 15 minutes !!! | 0.652041 !!! |
1 | 16 | 16 | 4.7 | 0.763924 |
1 | 16 | 1 | 4.8 | 0.763698 |
1 | 1 | 1 | 13.8 | 0.763698 |
4 | 4 | 16 | 11.5 | 0.760917 |
no dask | 16 | 3.7 | 0.763698 | |
no dask | 1 | 13 | 0.763698 |
10M rows:
d_train = pd.read_csv("https://benchm-ml--int-enc.s3-us-west-2.amazonaws.com/train-10m-intenc.csv")
d_test = pd.read_csv("https://benchm-ml--int-enc.s3-us-west-2.amazonaws.com/test-10m-intenc.csv")
n_workers | n_threads | n_partitions | Time (sec) | AUC |
---|---|---|---|---|
16 | 1 | 16 | ||
1 | 16 | 16 | 28.5 | 0.773071 |
4 | 4 | 16 | 33.8 | 0.772040 |
no dask | 16 | 21.9 | 0.773280 |
m5.4xlarge 16c (8+8HT)
1M rows
integer encoding for simplicity