Describe the bug
There will raise LightGBMError: Machine list file doesn't contain the local machine when I run a lightgbm.LGBMClassifier.fit on a Mars cluster which runs on Ray.
To Reproduce
To help us reproducing this bug, please provide information below:
Your Python version: python 3.7.9
The version of Mars you use: 0.10.0
Versions of crucial packages, such as numpy, scipy and pandas: numpy 1.21.6, pandas 1.3.5, lightgbm 3.32
import pandas as pd
import mars.dataframe as md
df = pd.read_csv("./Breast_cancer_data.csv")
mdf = md.DataFrame(data=df, chunk_size=300)
X = mdf[['mean_radius','mean_texture','mean_perimeter','mean_area','mean_smoothness']]
y = mdf['diagnosis']
from mars.learn.contrib import lightgbm as lgb
gbm = lgb.LGBMClassifier(importance_type='gain')
gbm.fit(X, y)
The results are as follows:
2023-05-25 19:32:31,136 ERROR threading.py:870 -- Got unhandled error when handling message ('run', 0, (<Subtask id=dKPvo4XoC1caISqsBgPdSr1D results=[LGBMTrain(f48c751592621feca43dcee83cb7e6c8_0)]>,), {}) in actor b'oTwzkmb1xDpbruGF6ienLYTb_subtask_processor' at ray://mars_cluster_1685014327/1/3
Traceback (most recent call last):
File "mars/oscar/core.pyx", line 519, in mars.oscar.core._BaseActor.__on_receive__
File "mars/oscar/core.pyx", line 404, in _handle_actor_result
File "mars/oscar/core.pyx", line 447, in mars.oscar.core._BaseActor._run_actor_async_generator
File "mars/oscar/core.pyx", line 448, in mars.oscar.core._BaseActor._run_actor_async_generator
File "mars/oscar/core.pyx", line 453, in mars.oscar.core._BaseActor._run_actor_async_generator
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 641, in run
result = yield self._running_aio_task
File "mars/oscar/core.pyx", line 458, in mars.oscar.core._BaseActor._run_actor_async_generator
File "mars/oscar/core.pyx", line 378, in _handle_actor_result
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 474, in run
await self._execute_graph(chunk_graph)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 231, in _execute_graph
await to_wait
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/lib/aio/_threads.py", line 36, in to_thread
return await loop.run_in_executor(None, func_call)
File "/usr/local/python3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/core/mode.py", line 77, in _inner
return func(*args, **kwargs)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 199, in _execute_operand
raise ExecutionError(ex).with_traceback(ex.__traceback__) from None
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 196, in _execute_operand
return execute(ctx, op)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/core/operand/core.py", line 491, in execute
result = executor(results, op)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/learn/contrib/lightgbm/_train.py", line 390, in execute
**op.kwds,
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/sklearn.py", line 972, in fit
callbacks=callbacks, init_model=init_model)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/sklearn.py", line 758, in fit
callbacks=callbacks
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/engine.py", line 271, in train
booster = Booster(params=params, train_set=train_set)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/basic.py", line 2602, in __init__
num_machines=params["num_machines"]
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/basic.py", line 2745, in set_network
ctypes.c_int(num_machines)))
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/basic.py", line 125, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
mars.core.base.ExecutionError: Machine list file doesn't contain the local machine
2023-05-25 19:32:31,139 ERROR api.py:121 -- Got unhandled error when handling message ('run_subtask', 0, (<Subtask id=dKPvo4XoC1caISqsBgPdSr1D results=[LGBMTrain(f48c751592621feca43dcee83cb7e6c8_0)]>,), {}) in actor b'slot_numa-0_2_subtask_runner' at ray://mars_cluster_1685014327/1/3
Traceback (most recent call last):
File "mars/oscar/core.pyx", line 519, in mars.oscar.core._BaseActor.__on_receive__
File "mars/oscar/core.pyx", line 404, in _handle_actor_result
File "mars/oscar/core.pyx", line 447, in mars.oscar.core._BaseActor._run_actor_async_generator
File "mars/oscar/core.pyx", line 448, in mars.oscar.core._BaseActor._run_actor_async_generator
File "mars/oscar/core.pyx", line 453, in mars.oscar.core._BaseActor._run_actor_async_generator
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/runner.py", line 147, in run_subtask
result = yield self._running_processor.run(subtask)
File "mars/oscar/core.pyx", line 458, in mars.oscar.core._BaseActor._run_actor_async_generator
File "mars/oscar/core.pyx", line 378, in _handle_actor_result
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/oscar/backends/context.py", line 196, in send
return self._process_result_message(result)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/oscar/backends/context.py", line 76, in _process_result_message
raise message.as_instanceof_cause()
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/oscar/backends/pool.py", line 677, in send
result = await self._run_coro(message.message_id, coro)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/oscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/oscar/api.py", line 121, in __on_receive__
return await super().__on_receive__(message)
File "mars/oscar/core.pyx", line 526, in __on_receive__
File "mars/oscar/core.pyx", line 519, in mars.oscar.core._BaseActor.__on_receive__
File "mars/oscar/core.pyx", line 404, in _handle_actor_result
File "mars/oscar/core.pyx", line 447, in mars.oscar.core._BaseActor._run_actor_async_generator
File "mars/oscar/core.pyx", line 448, in mars.oscar.core._BaseActor._run_actor_async_generator
File "mars/oscar/core.pyx", line 453, in mars.oscar.core._BaseActor._run_actor_async_generator
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 641, in run
result = yield self._running_aio_task
File "mars/oscar/core.pyx", line 458, in mars.oscar.core._BaseActor._run_actor_async_generator
File "mars/oscar/core.pyx", line 378, in _handle_actor_result
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 474, in run
await self._execute_graph(chunk_graph)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 231, in _execute_graph
await to_wait
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/lib/aio/_threads.py", line 36, in to_thread
return await loop.run_in_executor(None, func_call)
File "/usr/local/python3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/core/mode.py", line 77, in _inner
return func(*args, **kwargs)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 199, in _execute_operand
raise ExecutionError(ex).with_traceback(ex.__traceback__) from None
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 196, in _execute_operand
return execute(ctx, op)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/core/operand/core.py", line 491, in execute
result = executor(results, op)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/learn/contrib/lightgbm/_train.py", line 390, in execute
**op.kwds,
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/sklearn.py", line 972, in fit
callbacks=callbacks, init_model=init_model)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/sklearn.py", line 758, in fit
callbacks=callbacks
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/engine.py", line 271, in train
booster = Booster(params=params, train_set=train_set)
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/basic.py", line 2602, in __init__
num_machines=params["num_machines"]
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/basic.py", line 2745, in set_network
ctypes.c_int(num_machines)))
File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/basic.py", line 125, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
mars.core.base.ExecutionError: [address=ray://mars_cluster_1685014327/1/3, pid=400941] Machine list file doesn't contain the local machine
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
Describe the bug There will raise
LightGBMError: Machine list file doesn't contain the local machine
when I run alightgbm.LGBMClassifier.fit
on a Mars cluster which runs on Ray.To Reproduce To help us reproducing this bug, please provide information below:
I launched a Mars cluster running on 4 nodes Ray, 1 supervisor and 3 workers. The Supervisor occupies a node, and the other 3 worker are on 3 different nodes.
Breast_cancer_data.csv
is from https://www.kaggle.com/code/prashant111/lightgbm-classifier-in-python/inputThe results are as follows:
Expected behavior A clear and concise description of what you expected to happen.
Additional context Add any other context about the problem here.