szilard / GBM-perf

Performance of various open source GBM implementations
MIT License
213 stars 28 forks source link

Dask lightgbm #51

Open szilard opened 3 years ago

szilard commented 3 years ago

m5.4xlarge 16c (8+8HT)

1M rows

integer encoding for simplicity

szilard commented 3 years ago

Setup for now (WIP):

sudo docker run --rm  -ti -p 8787:8787 continuumio/anaconda3 /bin/bash

pip3 install -U lightgbm

## for dask lightgm for now use this:
wget https://raw.githubusercontent.com/jameslamb/talks/main/recent-developments-in-lightgbm/Dockerfile
sudo docker build -t dasklgbm .
sudo docker run --rm  -p 8787:8787 dasklgbm
sudo docker ps -a
sudo docker exec -ti ... /bin/bash
ipython
szilard commented 3 years ago

Plain lightgbm:

import pandas as pd
import numpy as np
from sklearn import metrics

import lightgbm as lgb

d_train = pd.read_csv("https://raw.githubusercontent.com/szilard/benchm-ml--data/master/int_enc/train-1m-intenc.csv")
d_test = pd.read_csv("https://raw.githubusercontent.com/szilard/benchm-ml--data/master/int_enc/test-1m-intenc.csv")

X_train = d_train.iloc[:, :-1].to_numpy()
y_train = d_train.iloc[:,-1:].to_numpy()
X_test = d_test.iloc[:, :-1].to_numpy()
y_test = d_test.iloc[:,-1:].to_numpy()

md = lgb.LGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100)
%time md.fit(X_train, y_train)

y_pred = md.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

Results:

Wall time: 3.72 s
0.7636986921602019
szilard commented 3 years ago

Dask lightgbm:

import pandas as pd
from sklearn import metrics

from dask.distributed import Client, LocalCluster
import dask.dataframe as dd

from lightgbm.dask import DaskLGBMClassifier

cluster = LocalCluster(n_workers=16, threads_per_worker=1)
client = Client(cluster)

d_train = pd.read_csv("https://raw.githubusercontent.com/szilard/benchm-ml--data/master/int_enc/train-1m-intenc.csv")
d_test = pd.read_csv("https://raw.githubusercontent.com/szilard/benchm-ml--data/master/int_enc/test-1m-intenc.csv")

dx_train = dd.from_pandas(d_train, npartitions=16)
dx_test = dd.from_pandas(d_test, npartitions=1)

X_train = dx_train.iloc[:, :-1].to_dask_array(lengths=True)
y_train = dx_train.iloc[:,-1:].to_dask_array(lengths=True)
X_test = dx_test.iloc[:, :-1].to_dask_array(lengths=True)
y_test = dx_test.iloc[:,-1:].to_dask_array(lengths=True)

X_train.persist()
y_train.persist()

client.has_what()

md = DaskLGBMClassifier(num_leaves=512, learning_rate=0.1, n_estimators=100, tree_learner="data", silent=False)
%time md.fit( client=client, X=X_train, y=y_train)

md_loc = md.to_local()
X_test_loc = X_test.compute()

y_pred = md_loc.predict_proba(X_test)[:,1]
print(metrics.roc_auc_score(y_test, y_pred))

With 16 workers, 1 thread each it's terribly slow. Filed bug report here: https://github.com/microsoft/LightGBM/issues/3797#issuecomment-764566963

szilard commented 3 years ago

Changing number of workers, threads, partitions:

n_workers n_threads n_partitions Time (sec) AUC
16 1 16 15 minutes !!! 0.652041 !!!
1 16 16 4.7 0.763924
1 16 1 4.8 0.763698
1 1 1 13.8 0.763698
4 4 16 11.5 0.760917
no dask 16 3.7 0.763698
no dask 1 13 0.763698
szilard commented 3 years ago

10M rows:

d_train = pd.read_csv("https://benchm-ml--int-enc.s3-us-west-2.amazonaws.com/train-10m-intenc.csv")
d_test = pd.read_csv("https://benchm-ml--int-enc.s3-us-west-2.amazonaws.com/test-10m-intenc.csv")
n_workers n_threads n_partitions Time (sec) AUC
16 1 16
1 16 16 28.5 0.773071
4 4 16 33.8 0.772040
no dask 16 21.9 0.773280