sjtu-marl / malib

A parallel framework for population-based multi-agent reinforcement learning.
https://malib.io
MIT License
498 stars 60 forks source link

thread: Resource temporarily unavailable #33

Closed Bender1019 closed 2 years ago

Bender1019 commented 2 years ago

Ray cluster crashed when num_episodes is set up to 64 and higher.

`[2021-12-13 14:49:20,867][INFO] registered request handler=optimization [2021-12-13 14:49:20,867][INFO] registered request handler=simulation [2021-12-13 14:49:20,867][INFO] registered request handler=evaluate [2021-12-13 14:49:20,867][INFO] registered request handler=update_payofftable [2021-12-13 14:49:20,867][INFO] registered request handler=rollout [2021-12-13 14:49:20,870][INFO] Pre launch checking for Coordinator server ... <function _request_simulation at 0x7fee8480d0d0> 2021-12-13 14:49:20,873 INFO worker.py:657 -- Connecting to existing Ray cluster at address: (pid=219, 168) [2021-12-13 14:49:31,968][INFO] dataset server initialized with (table_capacity=256 table_learning_start=64) (pid=220, 188) WARNING:root:Cannot import alpharank utils, if you wanna run meta game experiments, please install open_spiel before that. (pid=220, 188) [2021-12-13 14:49:32,388][INFO] registered request handler=optimization (pid=220, 188) [2021-12-13 14:49:32,389][INFO] registered request handler=simulation (pid=220, 188) [2021-12-13 14:49:32,389][INFO] registered request handler=evaluate (pid=220, 188) [2021-12-13 14:49:32,389][INFO] registered request handler=update_payofftable (pid=220, 188) [2021-12-13 14:49:32,389][INFO] registered request handler=rollout (pid=380) [2021-12-13 14:49:35,107][INFO] ray.get_gpu_ids(): [7] (pid=380) [2021-12-13 14:49:35,108][INFO] CUDA_VISIBLE_DEVICES: 7 (pid=220, 188) [2021-12-13 14:49:35,365][INFO] training manager launched, 1 learner(s) created (pid=220, 188) [2021-12-13 14:49:35,366][INFO] set worker num as 1 (pid=220, 188) [2021-12-13 14:49:35,373][INFO] RolloutWorker manager launched, 1 rollout worker(s) alives. (pid=220, 188) [2021-12-13 14:49:35,374][INFO] use_init_policy_pool: False (pid=380) WARNING:root:Cannot import alpharank utils, if you wanna run meta game experiments, please install open_spiel before that. (pid=380) [2021-12-13 14:49:35,344][INFO] registered request handler=optimization (pid=380) [2021-12-13 14:49:35,344][INFO] registered request handler=simulation (pid=380) [2021-12-13 14:49:35,344][INFO] registered request handler=evaluate (pid=380) [2021-12-13 14:49:35,344][INFO] registered request handler=update_payofftable (pid=380) [2021-12-13 14:49:35,344][INFO] registered request handler=rollout (pid=508) WARNING:root:Cannot import alpharank utils, if you wanna run meta game experiments, please install open_spiel before that. (pid=508) [2021-12-13 14:49:37,428][INFO] registered request handler=optimization (pid=508) [2021-12-13 14:49:37,428][INFO] registered request handler=simulation (pid=508) [2021-12-13 14:49:37,428][INFO] registered request handler=evaluate (pid=508) [2021-12-13 14:49:37,428][INFO] registered request handler=update_payofftable (pid=508) [2021-12-13 14:49:37,428][INFO] registered request handler=rollout (pid=220, 188) [2021-12-13 14:49:39,592][INFO] Coordinator server started (pid=220, 188) [2021-12-13 14:49:39,635][INFO] request: TaskType.OPTIMIZE (pid=220, 188) [2021-12-13 14:49:39,636][INFO] request: TaskType.ROLLOUT (pid=219, 168) [2021-12-13 14:49:39,726][INFO] created data table: PSGFootball_team_0_MAPPO_0 (pid=219, 168) terminate called after throwing an instance of 'boost::wrapexcept' (pid=219, 168) what(): thread: Resource temporarily unavailable 2021-12-13 14:51:18,750 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=259, 94) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(*args, kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:19,750 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=226, 94) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(*args, *kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:19,750 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=233, 94) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(args, kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:19,750 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=227, 94) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(*args, kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:19,750 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=256, 94) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(*args, *kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:19,750 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=263, 94) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(args, kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:20,751 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=240, 94) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(*args, kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:24,751 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=224, 120) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(*args, *kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:24,751 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=277, 120) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(args, kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:24,751 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=228, 120) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(*args, kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:24,751 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=219, 120) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(*args, *kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:25,751 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=267, 120) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(args, kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:25,752 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=222, 120) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(*args, kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:25,752 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=288, 120) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(*args, *kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:25,752 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=225, 120) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(args, kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:26,752 ERROR worker.py:980 -- Possible unhandled error from worker: ray::Stepping.run() (pid=226, 120) File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor File "/home/////malib/utils/logger/init.py", line 136, in wrapper return func(*args, **kwargs) File "/home/////malib/rollout/rollout_func.py", line 431, in run dataset_server=self._dataset_server if task_type == "rollout" else None, File "/home/////malib/rollout/rollout_func.py", line 291, in env_runner batch = ray.get(dataset_server.get_producer_index.remote(buffer_desc)) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. 2021-12-13 14:51:37,933 WARNING worker.py:1034 -- The node with node id 81c5e01345f7d92b30121df0b3af788325462cb9 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

`