mars-project / mars

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.
https://mars-project.readthedocs.io
Apache License 2.0
2.68k stars 325 forks source link

[BUG][Ray] LightGBMError: Machine list file doesn't contain the local machine #3350

Closed zhongchun closed 1 year ago

zhongchun commented 1 year ago

Describe the bug There will raise LightGBMError: Machine list file doesn't contain the local machine when I run a lightgbm.LGBMClassifier.fit on a Mars cluster which runs on Ray.

To Reproduce To help us reproducing this bug, please provide information below:

  1. Your Python version: python 3.7.9
  2. The version of Mars you use: 0.10.0
  3. Versions of crucial packages, such as numpy, scipy and pandas: numpy 1.21.6, pandas 1.3.5, lightgbm 3.32
  4. Full stack of the error.
  5. Minimized code to reproduce the error.

I launched a Mars cluster running on 4 nodes Ray, 1 supervisor and 3 workers. The Supervisor occupies a node, and the other 3 worker are on 3 different nodes. Breast_cancer_data.csv is from https://www.kaggle.com/code/prashant111/lightgbm-classifier-in-python/input

import pandas as pd
import mars.dataframe as md

df = pd.read_csv("./Breast_cancer_data.csv")
mdf = md.DataFrame(data=df, chunk_size=300)

X = mdf[['mean_radius','mean_texture','mean_perimeter','mean_area','mean_smoothness']]
y = mdf['diagnosis']

from mars.learn.contrib import lightgbm as lgb

gbm = lgb.LGBMClassifier(importance_type='gain')
gbm.fit(X, y)

The results are as follows:

2023-05-25 19:32:31,136 ERROR threading.py:870 -- Got unhandled error when handling message ('run', 0, (<Subtask id=dKPvo4XoC1caISqsBgPdSr1D results=[LGBMTrain(f48c751592621feca43dcee83cb7e6c8_0)]>,), {}) in actor b'oTwzkmb1xDpbruGF6ienLYTb_subtask_processor' at ray://mars_cluster_1685014327/1/3
Traceback (most recent call last):
  File "mars/oscar/core.pyx", line 519, in mars.oscar.core._BaseActor.__on_receive__
  File "mars/oscar/core.pyx", line 404, in _handle_actor_result
  File "mars/oscar/core.pyx", line 447, in mars.oscar.core._BaseActor._run_actor_async_generator
  File "mars/oscar/core.pyx", line 448, in mars.oscar.core._BaseActor._run_actor_async_generator
  File "mars/oscar/core.pyx", line 453, in mars.oscar.core._BaseActor._run_actor_async_generator
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 641, in run
    result = yield self._running_aio_task
  File "mars/oscar/core.pyx", line 458, in mars.oscar.core._BaseActor._run_actor_async_generator
  File "mars/oscar/core.pyx", line 378, in _handle_actor_result
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 474, in run
    await self._execute_graph(chunk_graph)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 231, in _execute_graph
    await to_wait
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/lib/aio/_threads.py", line 36, in to_thread
    return await loop.run_in_executor(None, func_call)
  File "/usr/local/python3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/core/mode.py", line 77, in _inner
    return func(*args, **kwargs)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 199, in _execute_operand
    raise ExecutionError(ex).with_traceback(ex.__traceback__) from None
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 196, in _execute_operand
    return execute(ctx, op)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/core/operand/core.py", line 491, in execute
    result = executor(results, op)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/learn/contrib/lightgbm/_train.py", line 390, in execute
    **op.kwds,
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/sklearn.py", line 972, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/sklearn.py", line 758, in fit
    callbacks=callbacks
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/engine.py", line 271, in train
    booster = Booster(params=params, train_set=train_set)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/basic.py", line 2602, in __init__
    num_machines=params["num_machines"]
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/basic.py", line 2745, in set_network
    ctypes.c_int(num_machines)))
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
mars.core.base.ExecutionError: Machine list file doesn't contain the local machine
2023-05-25 19:32:31,139 ERROR api.py:121 -- Got unhandled error when handling message ('run_subtask', 0, (<Subtask id=dKPvo4XoC1caISqsBgPdSr1D results=[LGBMTrain(f48c751592621feca43dcee83cb7e6c8_0)]>,), {}) in actor b'slot_numa-0_2_subtask_runner' at ray://mars_cluster_1685014327/1/3
Traceback (most recent call last):
  File "mars/oscar/core.pyx", line 519, in mars.oscar.core._BaseActor.__on_receive__
  File "mars/oscar/core.pyx", line 404, in _handle_actor_result
  File "mars/oscar/core.pyx", line 447, in mars.oscar.core._BaseActor._run_actor_async_generator
  File "mars/oscar/core.pyx", line 448, in mars.oscar.core._BaseActor._run_actor_async_generator
  File "mars/oscar/core.pyx", line 453, in mars.oscar.core._BaseActor._run_actor_async_generator
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/runner.py", line 147, in run_subtask
    result = yield self._running_processor.run(subtask)
  File "mars/oscar/core.pyx", line 458, in mars.oscar.core._BaseActor._run_actor_async_generator
  File "mars/oscar/core.pyx", line 378, in _handle_actor_result
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/oscar/backends/context.py", line 196, in send
    return self._process_result_message(result)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/oscar/backends/context.py", line 76, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/oscar/backends/pool.py", line 677, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/oscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/oscar/api.py", line 121, in __on_receive__
    return await super().__on_receive__(message)
  File "mars/oscar/core.pyx", line 526, in __on_receive__
  File "mars/oscar/core.pyx", line 519, in mars.oscar.core._BaseActor.__on_receive__
  File "mars/oscar/core.pyx", line 404, in _handle_actor_result
  File "mars/oscar/core.pyx", line 447, in mars.oscar.core._BaseActor._run_actor_async_generator
  File "mars/oscar/core.pyx", line 448, in mars.oscar.core._BaseActor._run_actor_async_generator
  File "mars/oscar/core.pyx", line 453, in mars.oscar.core._BaseActor._run_actor_async_generator
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 641, in run
    result = yield self._running_aio_task
  File "mars/oscar/core.pyx", line 458, in mars.oscar.core._BaseActor._run_actor_async_generator
  File "mars/oscar/core.pyx", line 378, in _handle_actor_result
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 474, in run
    await self._execute_graph(chunk_graph)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 231, in _execute_graph
    await to_wait
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/lib/aio/_threads.py", line 36, in to_thread
    return await loop.run_in_executor(None, func_call)
  File "/usr/local/python3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/core/mode.py", line 77, in _inner
    return func(*args, **kwargs)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 199, in _execute_operand
    raise ExecutionError(ex).with_traceback(ex.__traceback__) from None
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/services/subtask/worker/processor.py", line 196, in _execute_operand
    return execute(ctx, op)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/core/operand/core.py", line 491, in execute
    result = executor(results, op)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/mars/learn/contrib/lightgbm/_train.py", line 390, in execute
    **op.kwds,
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/sklearn.py", line 972, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/sklearn.py", line 758, in fit
    callbacks=callbacks
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/engine.py", line 271, in train
    booster = Booster(params=params, train_set=train_set)
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/basic.py", line 2602, in __init__
    num_machines=params["num_machines"]
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/basic.py", line 2745, in set_network
    ctypes.c_int(num_machines)))
  File "/home/admin/ray-pack/tmp/job/9f040080/pyenv/lib/python3.7/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
mars.core.base.ExecutionError: [address=ray://mars_cluster_1685014327/1/3, pid=400941] Machine list file doesn't contain the local machine

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.