ray-project / xgboost_ray

Distributed XGBoost on Ray
Apache License 2.0
143 stars 34 forks source link

HIGGS example not working #81

Closed vecorro closed 3 years ago

vecorro commented 3 years ago

Hi,

I'm trying to reproduce the example about the HIGGS dataset but when I start the training I get the following messages and the actors don't get assigned any data or training jobs but 1 out of 4. the only actor that gets to do some work eventually dies as well. I'm using Python 3.7.10 on Mac OS, RAY 1.2.0 and xgboost_ray 0.0.4.

Your help would be appreciated. Thanks!

2021-04-13 18:01:45,108 INFO main.py:791 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
2021-04-13 18:01:51,423 ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71384, ip=192.168.0.164)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 427, in load_data
    self._local_n = len(param["data"])
TypeError: object of type 'NoneType' has no len()
2021-04-13 18:01:51,424 ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71377, ip=192.168.0.164)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 427, in load_data
    self._local_n = len(param["data"])
TypeError: object of type 'NoneType' has no len()
2021-04-13 18:01:51,424 ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71379, ip=192.168.0.164)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 427, in load_data
    self._local_n = len(param["data"])
TypeError: object of type 'NoneType' has no len()
2021-04-13 18:01:51,425 ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71384, ip=192.168.0.164)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 427, in load_data
    self._local_n = len(param["data"])
TypeError: object of type 'NoneType' has no len()
2021-04-13 18:01:51,425 ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71377, ip=192.168.0.164)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 427, in load_data
    self._local_n = len(param["data"])
TypeError: object of type 'NoneType' has no len()
2021-04-13 18:01:51,426 ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71379, ip=192.168.0.164)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 427, in load_data
    self._local_n = len(param["data"])
TypeError: object of type 'NoneType' has no len()
2021-04-13 18:02:15,172 INFO main.py:823 -- Waiting until actors are ready (30 seconds passed).
2021-04-13 18:02:45,187 INFO main.py:823 -- Waiting until actors are ready (60 seconds passed).
2021-04-13 18:03:10,454 ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.load_data() (pid=71383, ip=192.168.0.164)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'label'

The above exception was the direct cause of the following exception:

ray::RayXGBoostActor.load_data() (pid=71383, ip=192.168.0.164)
  File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py", line 423, in load_data
    param = data.get_data(self.rank, self.num_actors)
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/matrix.py", line 707, in get_data
    self.load_data(num_actors=num_actors, rank=rank)
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/matrix.py", line 694, in load_data
    self.num_actors, self.sharding, rank=rank)
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/matrix.py", line 467, in load_data
    local_df, data_source=data_source)
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/matrix.py", line 199, in _split_dataframe
    label, exclude = data_source.get_column(local_data, self.label)
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/data_sources/data_source.py", line 106, in get_column
    return data[column], column
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/kike/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'label'
2021-04-13 18:03:15,224 INFO main.py:823 -- Waiting until actors are ready (90 seconds passed).
2021-04-13 18:03:45,276 INFO main.py:823 -- Waiting until actors are ready (120 seconds passed).
2021-04-13 18:04:15,352 INFO main.py:823 -- Waiting until actors are ready (150 seconds passed).
2021-04-13 18:04:22,525 INFO main.py:834 -- [RayXGBoost] Starting XGBoost training.
2021-04-13 18:05:35,648 INFO elastic.py:155 -- Actor status: 4 alive, 0 dead (4 total)
2021-04-13 18:05:40,344 ERROR worker.py:1053 -- Possible unhandled error from worker: ray::RayXGBoostActor.train() (pid=71383, ip=192.168.0.164)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'label'
richardliaw commented 3 years ago

Hey @vecorro can you provide a reproducible example?

vecorro commented 3 years ago

Sure, thanks for responding. Please consider that I downloaded and decompressed the Higgs dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz

I'm using macOS 11.2.3 ray 1.2.0 xgboost-ray 0.0.4

The code comes from the example that the ray-project GitHub repo hosts: https://github.com/ray-project/xgboost_ray/blob/master/examples/higgs.py

And the resulting errors are the ones I initially reported.


import os
import time
import ray
from xgboost_ray import train, RayDMatrix, RayParams

ray.init()

colnames = ["label"] + ["feature-%02d" % i for i in range(1, 29)]

dtrain = RayDMatrix("HIGGS.csv", label="label", names=colnames)

config = {
    "tree_method": "hist",
    "eval_metric": ["logloss", "error"],
}

evals_result = {}

dtrain.feature_names

start = time.time()
bst = train(
    config,
    dtrain,
    evals_result=evals_result,
    ray_params=RayParams(max_actor_restarts=1),
    num_boost_round=100,
    evals=[(dtrain, "train")])
taken = time.time() - start
print(f"TRAIN TIME TAKEN: {taken:.2f} seconds")

bst.save_model("higgs.xgb")
print("Final training error: {:.4f}".format(
    evals_result["train"]["error"][-1]))
krfricke commented 3 years ago

I could reproduce it and proposed a fix in #83.

The reason this happens is that the CSV string was detected as a distributed loadable data source. However, since this is only one file, only one actor would be assigned this file. Explicitly passing distributed=False would have fixed this, but it's probably better to treat a single CSV reference as a non-distributed loadable dat asource.

Additionally loading would fail because kwargs were not passed.

This is actually related to #12 - if we had the warning/error in place, this would have been detected much easier. I'll take care of that next.

vecorro commented 3 years ago

Thanks @krfricke

If I use the distributed=False to load the file. I eventually get a KeyError() as it looks like the column names I provide in dtrain = RayDMatrix("HIGGS.csv", label="label", distributed=False, names=colnames) don't get passed to the downstream instructions. I wonder if when you reproduced the initially reported error, the whole program executed properly for you.

I modified the code to load the CSV file in a Pandas df. Now I'm facing a different error. The bottom line is that I wonder about the config that the person that published the example https://github.com/ray-project/xgboost_ray/blob/master/examples/higgs.py used to make it work.

Here the new code followed by the error that seems to be related to resource allocation. My computer has 16 cores and 64GB of RAM. Before executing the script, Ray isn't running and the RAM is used below 30% (40GB of memory are free). The Higgs dataset is around 8GB only.


import os
import time
import ray
from xgboost_ray import train, RayDMatrix, RayParams
import pandas as pd

ray.init()

colnames = ["label"] + ["feature-%02d" % i for i in range(1, 29)]

df = pd.read_csv("HIGGS.csv", names=colnames)

dtrain = RayDMatrix(df, label="label",)

config = {
    "tree_method": "hist",
    "eval_metric": ["logloss", "error"],
}

evals_result = {}

start = time.time()
bst = train(
    config,
    dtrain,
    evals_result=evals_result,
    ray_params=RayParams(max_actor_restarts=1),
    num_boost_round=100,
    evals=[(dtrain, "train")])
taken = time.time() - start
print(f"TRAIN TIME TAKEN: {taken:.2f} seconds")

bst.save_model("higgs.xgb")
print("Final training error: {:.4f}".format(
    evals_result["train"]["error"][-1]))

2021-04-14 10:09:13,022 INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265
2021-04-14 10:10:39,945 INFO main.py:791 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
2021-04-14 10:10:59,309 INFO main.py:834 -- [RayXGBoost] Starting XGBoost training.

---------------------------------------------------------------------------
ObjectStoreFullError                      Traceback (most recent call last)
<ipython-input-1-37ad17e7cfb5> in <module>
     27     ray_params=RayParams(max_actor_restarts=1),
     28     num_boost_round=100,
---> 29     evals=[(dtrain, "train")])
     30 taken = time.time() - start
     31 print(f"TRAIN TIME TAKEN: {taken:.2f} seconds")

~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py in train(params, dtrain, num_boost_round, evals, evals_result, additional_results, ray_params, _remote, *args, **kwargs)
   1192                 gpus_per_actor=gpus_per_actor,
   1193                 _training_state=training_state,
-> 1194                 **kwargs)
   1195             if training_state.training_started_at > 0.:
   1196                 total_training_time += time.time(

~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py in _train(params, dtrain, evals, ray_params, cpus_per_actor, gpus_per_actor, _training_state, *args, **kwargs)
    865     training_futures = [
    866         actor.train.remote(rabit_args, params, dtrain, evals, *args, **kwargs)
--> 867         for actor in _training_state.actors if actor is not None
    868     ]
    869 

~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py in <listcomp>(.0)
    865     training_futures = [
    866         actor.train.remote(rabit_args, params, dtrain, evals, *args, **kwargs)
--> 867         for actor in _training_state.actors if actor is not None
    868     ]
    869 

~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/actor.py in remote(self, *args, **kwargs)
    107 
    108     def remote(self, *args, **kwargs):
--> 109         return self._remote(args, kwargs)
    110 
    111     def options(self, **options):

~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/actor.py in _remote(self, args, kwargs, name, num_returns)
    148             invocation = self._decorator(invocation)
    149 
--> 150         return invocation(args, kwargs)
    151 
    152     def __getstate__(self):

~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/actor.py in invocation(args, kwargs)
    142                 kwargs=kwargs,
    143                 name=name,
--> 144                 num_returns=num_returns)
    145 
    146         # Apply the decorator if there is one.

~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/actor.py in _actor_method_call(self, method_name, args, kwargs, name, num_returns)
    823         object_refs = worker.core_worker.submit_actor_task(
    824             self._ray_actor_language, self._ray_actor_id, function_descriptor,
--> 825             list_args, name, num_returns, self._ray_actor_method_cpus)
    826 
    827         if len(object_refs) == 1:

python/ray/_raylet.pyx in ray._raylet.CoreWorker.submit_actor_task()

python/ray/_raylet.pyx in ray._raylet.CoreWorker.submit_actor_task()

python/ray/_raylet.pyx in ray._raylet.prepare_args()

python/ray/_raylet.pyx in ray._raylet.CoreWorker.put_serialized_object()

python/ray/_raylet.pyx in ray._raylet.CoreWorker._create_put_buffer()

python/ray/_raylet.pyx in ray._raylet.check_status()

ObjectStoreFullError: Failed to put object ffffffffffffffffffffffffffffffffffffffff0100000022000000 in object store because it is full. Object size is 2552004917 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.
vecorro commented 3 years ago

Finally got it to work. I used ray.init(object_store_memory=34359738368) to increase the object_store_memory. I think that the pydoc description about object_store_memory might not be accurate or at least subject to wrong interpretations:

object_store_memory: The amount of memory (in bytes) to start the object store with. By default, this is automatically set based on available system memory.

krfricke commented 3 years ago

For the CSV file, you'll have to install the patch from #83 as that will take care of the KeyError as well.

For the memory error, te data is sharded into 4 shards (one per worker), and each piece seems to be around 2552004917 bytes, so 2.5 GB per shard, or 10 GB total. Can you run ray memory when this error comes up? It would be good to see how large your object store currently is.

Per default the object store memory is initialized to 30% of the available node memory (see here) - in your case this should be enough, but you might just want to set it a little bit higher initially.

You can set the object store memory with ray.init(object_store_memory=x) or on the command line with ray start --head --object-store-memory x

krfricke commented 3 years ago

Ah great, glad you got it to work!

vecorro commented 3 years ago

Thanks very much.