xgboost_ray on K8 breaks when running the HIGGS dataset example

vecorro commented 3 years ago

Hi,

My code runs properly when Ray runs on a single machine but when I try to run it on Ray 2.0 deployed on K8, the execution breaks at the point indicated by the message error. The Ray cluster seems to be working well as other example scrips run properly on it. From the dashboard, it looks like the head never tries to load the dataset. I'd appreciate your guidance to find and fix the problem.

import ray
import ray.util
import os
import time
from xgboost_ray import train, RayDMatrix, RayParams
import pandas as pd

ray.util.connect("127.0.0.1:10001") # replace with the appropriate IP and port numbers

{'num_clients': 1,
 'python_version': '3.7.7',
 'ray_version': '2.0.0.dev0',
 'ray_commit': 'b0a813baad0a4a187f0cc25e11498843aff899c6',
 'protocol_version': '2021-04-09'}

colnames = ["label"] + ["feature-%02d" % i for i in range(1, 29)]



df = pd.read_csv("../academy/academy-main/HIGGS.csv", names=colnames)



dtrain = RayDMatrix(df, label="label", distributed=False)



config = {

    "tree_method": "hist",

    "eval_metric": ["logloss", "error"],

}



evals_result = {}

#ray.init(object_store_memory=34359738368)



start = time.time()

bst = train(

    config,

    dtrain,

    evals_result=evals_result,

    ray_params=RayParams(max_actor_restarts=1),

    num_boost_round=100,

    evals=[(dtrain, "train")])

taken = time.time() - start

print(f"TRAIN TIME TAKEN: {taken:.2f} seconds")



bst.save_model("higgs.xgb")

print("Final training error: {:.4f}".format(

    evals_result["train"]["error"][-1]))

The node with node id: 3d09a2cf73e5497816818576f38466d9989b6d2f166896cc9ceccc3f and ip: 10.1.0.71 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

_InactiveRpcError Traceback (most recent call last) ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/worker.py in _call_schedule_for_task(self, task) 307 try: --> 308 ticket = self.server.Schedule(task, metadata=self.metadata) 309 except grpc.RpcError as e:

~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/grpc/_channel.py in call(self, request, timeout, metadata, credentials, wait_for_ready, compression) 824 state, call, = self._blocking(request, timeout, metadata, credentials, --> 825 wait_for_ready, compression) 826 return _end_unary_response_blocking(state, call, False, None)

~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/grpc/_channel.py in _blocking(self, request, timeout, metadata, credentials, wait_for_ready, compression) 803 if state is None: --> 804 raise rendezvous # pylint: disable-msg=raising-bad-type 805 else:

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.INTERNAL details = "Exception serializing request!" debug_error_string = "None"

During handling of the above exception, another exception occurred:

Error Traceback (most recent call last)
in 8 ray_params=RayParams(max_actor_restarts=1), 9 num_boost_round=100, ---> 10 evals=[(dtrain, "train")]) 11 taken = time.time() - start 12 print(f"TRAIN TIME TAKEN: {taken:.2f} seconds") ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py in train(params, dtrain, num_boost_round, evals, evals_result, additional_results, ray_params, _remote, *args, **kwargs) 1045 ray_params=ray_params, 1046 _remote=False, -> 1047 **kwargs, 1048 )) 1049 if isinstance(evals_result, dict): ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/remote_function.py in _remote_proxy(*args, **kwargs) 102 @wraps(function) 103 def _remote_proxy(*args, **kwargs): --> 104 return self._remote(args=args, kwargs=kwargs) 105 106 self.remote = _remote_proxy ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/remote_function.py in _remote(self, args, kwargs, num_returns, num_cpus, num_gpus, memory, object_store_memory, accelerator_type, resources, max_retries, placement_group, placement_group_bundle_index, placement_group_capture_child_tasks, runtime_env, override_environment_variables, name) 207 runtime_env=runtime_env, 208 override_environment_variables=override_environment_variables, --> 209 name=name) 210 211 worker = ray.worker.global_worker ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/_private/client_mode_hook.py in client_mode_convert_function(func_cls, in_args, in_kwargs, **kwargs) 85 setattr(func_cls, RAY_CLIENT_MODE_ATTR, key) 86 client_func = ray._get_converted(key) ---> 87 return client_func._remote(in_args, in_kwargs, **kwargs) 88 89 ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/common.py in _remote(self, args, kwargs, **option_args) 107 if kwargs is None: 108 kwargs = {} --> 109 return self.options(**option_args).remote(*args, **kwargs) 110 111 def __repr__(self): ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/common.py in remote(self, *args, **kwargs) 284 285 def remote(self, *args, **kwargs): --> 286 return return_refs(ray.call_remote(self, *args, **kwargs)) 287 288 def __getattr__(self, key): ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/api.py in call_remote(self, instance, *args, **kwargs) 94 kwargs: opaque keyword arguments 95 """ ---> 96 return self.worker.call_remote(instance, *args, **kwargs) 97 98 def call_release(self, id: bytes) -> None: ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/worker.py in call_remote(self, instance, *args, **kwargs) 299 for k, v in kwargs.items(): 300 task.kwargs[k].CopyFrom(convert_to_arg(v, self._client_id)) --> 301 return self._call_schedule_for_task(task) 302 303 def _call_schedule_for_task( ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/worker.py in _call_schedule_for_task(self, task) 308 ticket = self.server.Schedule(task, metadata=self.metadata) 309 except grpc.RpcError as e: --> 310 raise decode_exception(e.details()) 311 312 if not ticket.valid: ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/util/client/worker.py in decode_exception(data) 520 521 def decode_exception(data) -> Exception: --> 522 data = base64.standard_b64decode(data) 523 return loads_from_server(data) ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/base64.py in standard_b64decode(s) 103 are discarded prior to the padding check. 104 """ --> 105 return b64decode(s) 106 107 ~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/base64.py in b64decode(s, altchars, validate) 85 if validate and not re.fullmatch(b'[A-Za-z0-9+/]*={0,2}', s): 86 raise binascii.Error('Non-base64 digit found') ---> 87 return binascii.a2b_base64(s) 88 89 Error: Incorrect padding The node with node id: e52c10ddca43f8a398d222e6dec353457bb48faaaf3ff7ae72303e3e and ip: 10.1.0.69 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

richardliaw commented 3 years ago

@vecorro are you still running into this issue? could you try running on the latest versions of Ray?

vecorro commented 3 years ago

Thanks, @richardliaw. My main interest is on Ray + K8 si, I wonder if I should try Ray 2.0.0dev0 or Ray 1.3. Can you please provide guidance in that respect?

Enrique

richardliaw commented 3 years ago

Can you first try 1.3? Thanks!

richardliaw commented 3 years ago

Please make sure you have 1.3 on both the server side and the client side too.

vecorro commented 3 years ago

Sure thing, actually this is what I normally do

-ensure same version and build number at both ends -ensure the same pip package version numbers at both ends -try with multi-node and single node K8 clusters

I'll let you know how it goes.

krfricke commented 3 years ago

Hi @vecorro, do you have an update on this? Does this work for you with Ray 1.3?

krfricke commented 3 years ago

Closing this for now, feel free to reopen if problems persist.

ray-project / xgboost_ray

xgboost_ray on K8 breaks when running the HIGGS dataset example #88