Closed jinxu-ml closed 4 years ago
did you try it without GridSearchCV
?
It seems there are n_jobs
in both lgb.LGBMRegressor
and GridSearchCV
.
as one LightGBM job will occupy n_jobs
cores, if the n_jobs in GridSearchCV
is set to DRIVER_CORES
, the total threads are DRIVER_CORES * n_jobs
, which largely exceed the cores in your machines.
For efficiency, the total threads should equal with the number of cpu cores, otherwise, it will be much slower.
BTW, If your machine is NUMA, LightGBM may is slow due to the slow memory assessment across different cpu sockets.
For your other questions:
Firstly, I don' think the num_trees=1000
with num_leaves=8000
is a small model.
For the training error and test error, it mainly depends on your data.
The memory cost is not for the dataset, there are buffers for histograms, scores, model, etc.. As your model size is large, the memory cost could be also large.
Thanks for your reply. I have double checked my code and found that in "LGBMRegressor", I did not set "n_jobs". I set it on "GridSearchCV" by "DRIVER_CORES". So, the total parallel jobs should be "n_jobs" not "n_jobs * DRIVER_CORES". Also, I have run the same code on my laptop MacBook Pro with 8 cores and 16 GB memory for the case that "n_jobs= 8" and keep all other parameters same as the one run on AWS. It run time is around 10 minutes, which is faster than 17 minutes on AWS/EC2 (c4.8xlarge (36 cores, 60GB memory)). How could this happen ? thanks
If you don't set njobs for LightGBM, the behaviour is undefined. It could use all cores or 1 core, depends on your env. It is better to set it explicitly.
I have updated OP and the regressor's n_jobs are set by the DRIVER_CORES and the n_jobs of gridSearchCV. For example, if gridSearchCV has 2 x 2 grid points and DRIVER_CORES = 16, the n_jobs of the regressor is 16/4 = 4. I have rerun it on the same cluster but the run time is always 20mins no matter how many DRIVER_CORES that I used. Why no speed-up for more cores ? The total data size is only 1 GB.
will cv = ShuffleSplit(n_splits=10, train_size= 0.7, test_size=0.3)
increase the jobs to 10 times in GridSearch?
If n_jobs
of regressor
is set to k
, I suggest to set n_jobs
of GridSearchCV
to DRIVER_CORES//k
, not the DRIVER_CORES
.
This question is relevant to parallel training lightGBM regression model on all machines of databricks/AWS cluster. But, I show more code and details plus new questions. So, created a new one.
I am trying to run LightGBM to do some machine learning model training on AWS/EC2 clusters by databricks. The total data size is 1 GB (for training and test). The total feature size is around 18.
I am using grid search to search the best hyperparameters for the loightgbm model.
My python3 code (Just show the relevant one):
My questions: