microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.38k stars 3.81k forks source link

long training time for small size lightgbm (with 1GB training/test data size) model on AWS EC2 cluster with highly parallel jobs #2796

Closed jinxu-ml closed 4 years ago

jinxu-ml commented 4 years ago

This question is relevant to parallel training lightGBM regression model on all machines of databricks/AWS cluster. But, I show more code and details plus new questions. So, created a new one.

I am trying to run LightGBM to do some machine learning model training on AWS/EC2 clusters by databricks. The total data size is 1 GB (for training and test). The total feature size is around 18.

I am using grid search to search the best hyperparameters for the loightgbm model.

My python3 code (Just show the relevant one):

import lightgbm as lgb
from sklearn.model_selection import ShuffleSplit, train_test_split, GridSearchCV

grid_params = {
            'n_estimators' : [1000] # [2000, 4000],
          }

  # set parallel jobs
    total_grid_points = 1 
    for k in grid_params:
       total_grid_points *= len(grid_params[k])
    parallel_grid_search_jobs = min(DRIVER_CORES, total_grid_points)

params = {'objective': 'regression_l1',
        'boosting_type': 'goss',
        'metric': 'l1', 
        'max_depth' : -1 , # it is also set as 6, 8, 14, 20, 30
        'num_leaves': 8000, 
        'learning_rate' : 0.01, 
        'bagging_fraction': 1.0, 
        'bagging_freq': 1,
        'lambda_l1': 2.0,
        'lambda_l2': 2.0,
         'n_jobs':  max(DRIVER_CORES//parallel_grid_search_jobs, 1),
        }

gbm_regressor = lgb.LGBMRegressor(objective=params['objective'], 
                                num_leaves=params['num_leaves'],
                                boosting_type=params['boosting_type'], 
                                metric=params['metric'],
                                bagging_fraction=params['bagging_fraction'],
                                bagging_freq=params['bagging_freq'],
                                learning_rate=params['learning_rate'],
                                lambda_l1=params['lambda_l1'],
                                lambda_l2=params['lambda_l2'],
                                warm_start=True,
                                random_state=17,
                                verbose=1,
                                device='cpu',
                                silent = False)

cv = ShuffleSplit(n_splits=10, train_size= 0.7, test_size=0.3) 

here,   DRIVER_CORES = 8, # it can be set as 8, 16, 32

gbm = GridSearchCV(estimator=gbm_regressor, param_grid = grid_params, cv=cv, scoring=['neg_mean_absolute_error'], n_jobs=DRIVER_CORES, refit = 'neg_mean_absolute_error')

# this fitting cost a long time !!!
gbm.fit(X=X_train, y=y_train) # the total data size for traing and test is around 1 GB and 18 features.

 My cluster setting:

 driver : c4.8xlarge (36 cores, 60GB memory)
 1 master, 1 worker: r4.xlarge (4 cores, 30 GB)

  After the job is submitted to the cluster, it is only run on the driver. It used up all cpus no matter how many parallel jobs that I careated by setting n_jobs (8, 16, 32) on the driver node.

  For n_jobs = 8, it cost 17 minutes to get the gbm.fit(X=X_train, y=y_train) done.
  For n_jobs = 16, it cost 65 minutes to get the gbm.fit(X=X_train, y=y_train) done.
  For n_jobs = 32, it cost around 200 minutes to get the gbm.fit(X=X_train, y=y_train) done.

My questions:

 Why more parallel jobs (n_jobs) (all are run on driver node), the longer run time for the lightGBM model fitting ?

  The model size is small, the training data size is small and the LighGBM model is optimized by "goss". Why it cost so long time to do the training ?

  I have drawn the learning curve for the trainging and test dataset, which show that for smaller iterations (< 150), the test errors are less than the trainging error.

  How this could happen ?

  Why 30 GB is used on driver node even though the total data size is 1 GB and n_jobs = 8 ?

  If 8 copies are created for the same dataset, why 30 GB is consumed ?
guolinke commented 4 years ago

did you try it without GridSearchCV ? It seems there are n_jobs in both lgb.LGBMRegressor and GridSearchCV. as one LightGBM job will occupy n_jobs cores, if the n_jobs in GridSearchCV is set to DRIVER_CORES, the total threads are DRIVER_CORES * n_jobs, which largely exceed the cores in your machines.

For efficiency, the total threads should equal with the number of cpu cores, otherwise, it will be much slower.

BTW, If your machine is NUMA, LightGBM may is slow due to the slow memory assessment across different cpu sockets.

For your other questions: Firstly, I don' think the num_trees=1000 with num_leaves=8000 is a small model. For the training error and test error, it mainly depends on your data.

The memory cost is not for the dataset, there are buffers for histograms, scores, model, etc.. As your model size is large, the memory cost could be also large.

jinxu-ml commented 4 years ago

Thanks for your reply. I have double checked my code and found that in "LGBMRegressor", I did not set "n_jobs". I set it on "GridSearchCV" by "DRIVER_CORES". So, the total parallel jobs should be "n_jobs" not "n_jobs * DRIVER_CORES". Also, I have run the same code on my laptop MacBook Pro with 8 cores and 16 GB memory for the case that "n_jobs= 8" and keep all other parameters same as the one run on AWS. It run time is around 10 minutes, which is faster than 17 minutes on AWS/EC2 (c4.8xlarge (36 cores, 60GB memory)). How could this happen ? thanks

guolinke commented 4 years ago

If you don't set njobs for LightGBM, the behaviour is undefined. It could use all cores or 1 core, depends on your env. It is better to set it explicitly.

jinxu-ml commented 4 years ago

I have updated OP and the regressor's n_jobs are set by the DRIVER_CORES and the n_jobs of gridSearchCV. For example, if gridSearchCV has 2 x 2 grid points and DRIVER_CORES = 16, the n_jobs of the regressor is 16/4 = 4. I have rerun it on the same cluster but the run time is always 20mins no matter how many DRIVER_CORES that I used. Why no speed-up for more cores ? The total data size is only 1 GB.

guolinke commented 4 years ago

will cv = ShuffleSplit(n_splits=10, train_size= 0.7, test_size=0.3) increase the jobs to 10 times in GridSearch? If n_jobs of regressor is set to k, I suggest to set n_jobs of GridSearchCV to DRIVER_CORES//k, not the DRIVER_CORES.