Closed kira-lin closed 1 year ago
We do not modify the inner workings of XGBoost, therefore that question should be directed towards its developers. On our side, each Actor converts their shard of data to pandas and passes it to the XGBoost model.
I manually set nthread of param when creating Dmatrix in RayXgboostActor , and it solves my problem.
Interesting, can you elaborate on that?
Here is a part of the flamegraph I captured from my workload:
GOMP_Parallel
appears in the stack, but I saw only 100% CPU utilization during this period, and that made this process very slow. By the way, this InitData
is needed to convert data format when using hist
method in XGBoost.
I then addednthread
to param
when creating DMatrix in the function _get_dmatrix
in xgboost_ray/main.py, and it worked.
if isinstance(param["data"], list):
dm_param = {
"data": concat_dataframes(param["data"]),
"label": concat_dataframes(param["label"]),
"weight": concat_dataframes(param["weight"]),
"qid": concat_dataframes(param["qid"]),
"base_margin": concat_dataframes(param["base_margin"]),
"label_lower_bound": concat_dataframes(
param["label_lower_bound"]),
"label_upper_bound": concat_dataframes(
param["label_upper_bound"]),
}
param.update(dm_param)
ll = param.pop("label_lower_bound", None)
lu = param.pop("label_upper_bound", None)
if LEGACY_MATRIX:
param.pop("base_margin", None)
if "qid" not in inspect.signature(xgb.DMatrix).parameters:
param.pop("qid", None)
if data.enable_categorical is not None:
param["enable_categorical"] = data.enable_categorical
param["nthread"] = 32 # I hard code nthread here
matrix = xgb.DMatrix(**param)
Thanks, this is very useful to know! Perhaps we should just set it ourselves here.
that's not expected. I think the QuantileDMatrix/DMatrix uses all threads by default if nthread
is not specificed.
https://github.com/dmlc/xgboost/blob/ad0ccc6e4f65755c1a970151a51e07d580809de6/python-package/xgboost/core.py#L727
https://github.com/dmlc/xgboost/blob/ad0ccc6e4f65755c1a970151a51e07d580809de6/python-package/xgboost/core.py#L1383
It is possible it may be defaulting to 1 in the Ray Actor. I don't think there will be any harm to setting it to the CPUs the actor has assigned.
My script is here:
I checked the CPU utilization using top during training, and I notice that there is a long time when CPU utilization is 100% after the training actors start, about 300s. Total training time is over 500s. Then the utilization goes up to around 2000%. I used py-spy to dump the stack trace during the long single-core period, and found this function
xgboost::tree::QuantileHistMaker::Builder::InitData
took a long time to finish.I am not sure what this function is doing, and why it is single core, but it is now the bottleneck of my workload. Any help would be appreciated, Thanks