ray-project / xgboost_ray

Distributed XGBoost on Ray
Apache License 2.0
139 stars 34 forks source link

Execution hangs with warning when Modin DataFrames as input data are used #127

Open prutskov opened 3 years ago

prutskov commented 3 years ago

Actors claims all cluster resources. In the result warning is happened when use Modin:

Warning message, after that execution hangs:

2021-06-25 08:37:32,245 INFO main.py:892 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
2021-06-25 08:37:46,124 WARNING worker.py:1114 -- The actor or task with ID 46766c5b0167bafaffffffffffffffffffffffff01000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {0.000000/112.000000 CPU, 227.353494 GiB/227.353494 GiB memory, 101.428594 GiB/101.428594 GiB object_store_memory, 0.000000/112.000000 CPU_group_eb4a2862d9a2cf9c4c2617c2763cd46f, 28.000000/28.000000 CPU_group_2_eb4a2862d9a2cf9c4c2617c2763cd46f, 28.000000/28.000000 CPU_group_3_eb4a2862d9a2cf9c4c2617c2763cd46f, 0.980000/1.000000 node:10.125.130.30, 28.000000/28.000000 CPU_group_0_eb4a2862d9a2cf9c4c2617c2763cd46f, 28.000000/28.000000 CPU_group_1_eb4a2862d9a2cf9c4c2617c2763cd46f}
. In total there are 112 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.

Script to reproduce:

import time

import ray

import modin.pandas as pd
import xgboost_ray as xgb

ray.init()

def print_duration(start, message):
    duration = time.time() - start
    print(f'{message} time: {duration} s')

start = time.time()
data = pd.read_csv("../HIGGS.csv", header=None ) 
print_duration(start, 'read_csv')

start = time.time()
X, y = data.iloc[:, 1:], data.iloc[:,0]
X["label"] = y
print_duration(start, 'iloc')

start = time.time()
dmatrix = xgb.RayDMatrix(X, "label")
print_duration(start, 'dmatrix creation')

params = {
   "tree_method": "hist",
   "eta": 0.3,
   "eval_metric": ["logloss", "error"],
}
evals_result = dict()

start = time.time()
bst = xgb.train(
   params,
   dmatrix,
   num_boost_round=50,
   verbose_eval=10,
   evals=[(dmatrix, "train")],
   evals_result=evals_result)
print_duration(start, 'train')

start = time.time()
predictions = xgb.predict(bst, dmatrix)
print_duration(start, 'predict')

It uses HIGGS dataset.

Packages versions: Ray == 1.4 Modin == master xgboost_ray ==master

cc @krfricke

richardliaw commented 3 years ago

@Yard1 or @krfricke can you take a look at this?

Yard1 commented 3 years ago

I believe that you need to ensure that not all CPUs are occupied by xgboost ray actors so that modin actors can do their job. In the examples we do it through RayParams. I'll check if this is what needs to be done here.

Yard1 commented 3 years ago

@prutskov The issue is that Modin also uses Ray underneath - which means it uses the same actor pool. By default, XGBoost-Ray will schedule as many actors as you have cores available, which means no cores will be left for Modin. The fix is very simple:

bst = xgb.train(
   params,
   dmatrix,
   num_boost_round=50,
   verbose_eval=10,
   evals=[(dmatrix, "train")],
   evals_result=evals_result,
   ray_params=xgb.RayParams(num_actors=7)) # change to the number of cores you have minus 1

By explicitly telling XGBoost-Ray to spawn one less actors than you have CPU cores, Modin will be able to use the free core for its operations. Hope that helps!