Long single-core period during training hurts performance

kira-lin commented 1 year ago

My script is here:

xgb_params = {
    'tree_method': 'hist',
    'grow_policy': 'depthwise'
}
cpus_per_actor=32
bst = xgboost_ray.train(
    params=xgb_params,
    dtrain=dtrain,
    num_boost_round=100,
    ray_params=xgboost_ray.RayParams(
        max_actor_restarts=0,
        gpus_per_actor=0,
        cpus_per_actor=cpus_per_actor,
        num_actors=16
    )
 )

I checked the CPU utilization using top during training, and I notice that there is a long time when CPU utilization is 100% after the training actors start, about 300s. Total training time is over 500s. Then the utilization goes up to around 2000%. I used py-spy to dump the stack trace during the long single-core period, and found this function xgboost::tree::QuantileHistMaker::Builder::InitData took a long time to finish.

I am not sure what this function is doing, and why it is single core, but it is now the bottleneck of my workload. Any help would be appreciated, Thanks

Yard1 commented 1 year ago

We do not modify the inner workings of XGBoost, therefore that question should be directed towards its developers. On our side, each Actor converts their shard of data to pandas and passes it to the XGBoost model.

kira-lin commented 1 year ago

I manually set nthread of param when creating Dmatrix in RayXgboostActor , and it solves my problem.

Yard1 commented 1 year ago

Interesting, can you elaborate on that?

kira-lin commented 1 year ago

Here is a part of the flamegraph I captured from my workload: GOMP_Parallel appears in the stack, but I saw only 100% CPU utilization during this period, and that made this process very slow. By the way, this InitData is needed to convert data format when using hist method in XGBoost.

I then addednthread to param when creating DMatrix in the function _get_dmatrix in xgboost_ray/main.py, and it worked.

if isinstance(param["data"], list):
    dm_param = {
        "data": concat_dataframes(param["data"]),
        "label": concat_dataframes(param["label"]),
        "weight": concat_dataframes(param["weight"]),
        "qid": concat_dataframes(param["qid"]),
        "base_margin": concat_dataframes(param["base_margin"]),
        "label_lower_bound": concat_dataframes(
            param["label_lower_bound"]),
        "label_upper_bound": concat_dataframes(
            param["label_upper_bound"]),
    }
    param.update(dm_param)

    ll = param.pop("label_lower_bound", None)
    lu = param.pop("label_upper_bound", None)

    if LEGACY_MATRIX:
    param.pop("base_margin", None)

    if "qid" not in inspect.signature(xgb.DMatrix).parameters:
    param.pop("qid", None)

    if data.enable_categorical is not None:
    param["enable_categorical"] = data.enable_categorical

    param["nthread"] = 32 # I hard code nthread here
    matrix = xgb.DMatrix(**param)

Yard1 commented 1 year ago

Thanks, this is very useful to know! Perhaps we should just set it ourselves here.

trivialfis commented 1 year ago

that's not expected. I think the QuantileDMatrix/DMatrix uses all threads by default if nthread is not specificed. https://github.com/dmlc/xgboost/blob/ad0ccc6e4f65755c1a970151a51e07d580809de6/python-package/xgboost/core.py#L727 https://github.com/dmlc/xgboost/blob/ad0ccc6e4f65755c1a970151a51e07d580809de6/python-package/xgboost/core.py#L1383

Yard1 commented 1 year ago

It is possible it may be defaulting to 1 in the Ray Actor. I don't think there will be any harm to setting it to the CPUs the actor has assigned.

ray-project / xgboost_ray

Long single-core period during training hurts performance #266