sjtu-marl / malib

A parallel framework for population-based multi-agent reinforcement learning.
https://malib.io
MIT License
498 stars 60 forks source link

[mappo+gfootball] Failed to run on a ray cluster #31

Closed Bender1019 closed 2 years ago

Bender1019 commented 2 years ago

I tried to run this branch on a ray cluster, however got error messages below:

ray.exceptions.RayTaskError(_InactiveRpcError): ray::RolloutWorker.get_status()
  File "python/ray/_raylet.pyx", line 422, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 422, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 456, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 459, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor
  File "/home/malib_cls_1206/malib/rollout/rollout_worker.py", line 44, in __init__
    self, worker_index, env_desc, metric_type, remote, save, **kwargs
  File "/home/malib_cls_1206/malib/rollout/base_worker.py", line 102, in __init__
    **kwargs["exp_cfg"],
  File "/home/malib_cls_1206/malib/utils/logger/__init__.py", line 249, in get_logger
    primary=expr_group, secondary=expr_name
  File "/home/malib_cls_1206/malib/rpc/ExperimentManager/ExperimentClient.py", line 73, in create_table
    self._create_table_callback(future.result()[0])
  File "/home/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/home/anaconda3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/anaconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/malib_cls_1206/malib/rpc/ExperimentManager/ExperimentClient.py", line 47, in _create_table
    table_key = stub.CreateTable(table_name, **kwargs)
  File "/home/anaconda3/lib/python3.6/site-packages/grpc/_channel.py", line 826, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/anaconda3/lib/python3.6/site-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses"
    debug_error_string = "{"created":"@1638866326.987521646","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4133,"referenced_errors":[{"created":"@1638866326.987518864","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":397,"grpc_status":14}]}"
>

I just changed runner.py:61 to let ray runtime attach to ray cluster built beforehand. And num_episodes and other resources related parameters were also set up to a small value.

KornbergFresnel commented 2 years ago

@zyp57783 I fixed this issue in commit db001cc, update your local repository then have a try.