Closed vecorro closed 3 years ago
Hey @vecorro can you provide a reproducible example?
Sure, thanks for responding. Please consider that I downloaded and decompressed the Higgs dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
I'm using macOS 11.2.3 ray 1.2.0 xgboost-ray 0.0.4
The code comes from the example that the ray-project GitHub repo hosts: https://github.com/ray-project/xgboost_ray/blob/master/examples/higgs.py
And the resulting errors are the ones I initially reported.
import os
import time
import ray
from xgboost_ray import train, RayDMatrix, RayParams
ray.init()
colnames = ["label"] + ["feature-%02d" % i for i in range(1, 29)]
dtrain = RayDMatrix("HIGGS.csv", label="label", names=colnames)
config = {
"tree_method": "hist",
"eval_metric": ["logloss", "error"],
}
evals_result = {}
dtrain.feature_names
start = time.time()
bst = train(
config,
dtrain,
evals_result=evals_result,
ray_params=RayParams(max_actor_restarts=1),
num_boost_round=100,
evals=[(dtrain, "train")])
taken = time.time() - start
print(f"TRAIN TIME TAKEN: {taken:.2f} seconds")
bst.save_model("higgs.xgb")
print("Final training error: {:.4f}".format(
evals_result["train"]["error"][-1]))
I could reproduce it and proposed a fix in #83.
The reason this happens is that the CSV string was detected as a distributed loadable data source. However, since this is only one file, only one actor would be assigned this file. Explicitly passing distributed=False
would have fixed this, but it's probably better to treat a single CSV reference as a non-distributed loadable dat asource.
Additionally loading would fail because kwargs
were not passed.
This is actually related to #12 - if we had the warning/error in place, this would have been detected much easier. I'll take care of that next.
Thanks @krfricke
If I use the distributed=False
to load the file. I eventually get a KeyError()
as it looks like the column names I provide in dtrain = RayDMatrix("HIGGS.csv", label="label", distributed=False, names=colnames)
don't get passed to the downstream instructions. I wonder if when you reproduced the initially reported error, the whole program executed properly for you.
I modified the code to load the CSV file in a Pandas df. Now I'm facing a different error. The bottom line is that I wonder about the config that the person that published the example https://github.com/ray-project/xgboost_ray/blob/master/examples/higgs.py used to make it work.
Here the new code followed by the error that seems to be related to resource allocation. My computer has 16 cores and 64GB of RAM. Before executing the script, Ray isn't running and the RAM is used below 30% (40GB of memory are free). The Higgs dataset is around 8GB only.
import os
import time
import ray
from xgboost_ray import train, RayDMatrix, RayParams
import pandas as pd
ray.init()
colnames = ["label"] + ["feature-%02d" % i for i in range(1, 29)]
df = pd.read_csv("HIGGS.csv", names=colnames)
dtrain = RayDMatrix(df, label="label",)
config = {
"tree_method": "hist",
"eval_metric": ["logloss", "error"],
}
evals_result = {}
start = time.time()
bst = train(
config,
dtrain,
evals_result=evals_result,
ray_params=RayParams(max_actor_restarts=1),
num_boost_round=100,
evals=[(dtrain, "train")])
taken = time.time() - start
print(f"TRAIN TIME TAKEN: {taken:.2f} seconds")
bst.save_model("higgs.xgb")
print("Final training error: {:.4f}".format(
evals_result["train"]["error"][-1]))
2021-04-14 10:09:13,022 INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265
2021-04-14 10:10:39,945 INFO main.py:791 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
2021-04-14 10:10:59,309 INFO main.py:834 -- [RayXGBoost] Starting XGBoost training.
---------------------------------------------------------------------------
ObjectStoreFullError Traceback (most recent call last)
<ipython-input-1-37ad17e7cfb5> in <module>
27 ray_params=RayParams(max_actor_restarts=1),
28 num_boost_round=100,
---> 29 evals=[(dtrain, "train")])
30 taken = time.time() - start
31 print(f"TRAIN TIME TAKEN: {taken:.2f} seconds")
~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py in train(params, dtrain, num_boost_round, evals, evals_result, additional_results, ray_params, _remote, *args, **kwargs)
1192 gpus_per_actor=gpus_per_actor,
1193 _training_state=training_state,
-> 1194 **kwargs)
1195 if training_state.training_started_at > 0.:
1196 total_training_time += time.time(
~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py in _train(params, dtrain, evals, ray_params, cpus_per_actor, gpus_per_actor, _training_state, *args, **kwargs)
865 training_futures = [
866 actor.train.remote(rabit_args, params, dtrain, evals, *args, **kwargs)
--> 867 for actor in _training_state.actors if actor is not None
868 ]
869
~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/xgboost_ray/main.py in <listcomp>(.0)
865 training_futures = [
866 actor.train.remote(rabit_args, params, dtrain, evals, *args, **kwargs)
--> 867 for actor in _training_state.actors if actor is not None
868 ]
869
~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/actor.py in remote(self, *args, **kwargs)
107
108 def remote(self, *args, **kwargs):
--> 109 return self._remote(args, kwargs)
110
111 def options(self, **options):
~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/actor.py in _remote(self, args, kwargs, name, num_returns)
148 invocation = self._decorator(invocation)
149
--> 150 return invocation(args, kwargs)
151
152 def __getstate__(self):
~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/actor.py in invocation(args, kwargs)
142 kwargs=kwargs,
143 name=name,
--> 144 num_returns=num_returns)
145
146 # Apply the decorator if there is one.
~/opt/miniconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/actor.py in _actor_method_call(self, method_name, args, kwargs, name, num_returns)
823 object_refs = worker.core_worker.submit_actor_task(
824 self._ray_actor_language, self._ray_actor_id, function_descriptor,
--> 825 list_args, name, num_returns, self._ray_actor_method_cpus)
826
827 if len(object_refs) == 1:
python/ray/_raylet.pyx in ray._raylet.CoreWorker.submit_actor_task()
python/ray/_raylet.pyx in ray._raylet.CoreWorker.submit_actor_task()
python/ray/_raylet.pyx in ray._raylet.prepare_args()
python/ray/_raylet.pyx in ray._raylet.CoreWorker.put_serialized_object()
python/ray/_raylet.pyx in ray._raylet.CoreWorker._create_put_buffer()
python/ray/_raylet.pyx in ray._raylet.check_status()
ObjectStoreFullError: Failed to put object ffffffffffffffffffffffffffffffffffffffff0100000022000000 in object store because it is full. Object size is 2552004917 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.
Finally got it to work. I used ray.init(object_store_memory=34359738368)
to increase the object_store_memory. I think that the pydoc description about object_store_memory might not be accurate or at least subject to wrong interpretations:
object_store_memory: The amount of memory (in bytes) to start the object store with. By default, this is automatically set based on available system memory.
For the CSV file, you'll have to install the patch from #83 as that will take care of the KeyError as well.
For the memory error, te data is sharded into 4 shards (one per worker), and each piece seems to be around 2552004917 bytes, so 2.5 GB per shard, or 10 GB total. Can you run ray memory
when this error comes up? It would be good to see how large your object store currently is.
Per default the object store memory is initialized to 30% of the available node memory (see here) - in your case this should be enough, but you might just want to set it a little bit higher initially.
You can set the object store memory with ray.init(object_store_memory=x)
or on the command line with ray start --head --object-store-memory x
Ah great, glad you got it to work!
Thanks very much.
Hi,
I'm trying to reproduce the example about the HIGGS dataset but when I start the training I get the following messages and the actors don't get assigned any data or training jobs but 1 out of 4. the only actor that gets to do some work eventually dies as well. I'm using Python 3.7.10 on Mac OS, RAY 1.2.0 and xgboost_ray 0.0.4.
Your help would be appreciated. Thanks!