ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. #6

Closed 544211707 closed 2 years ago

544211707 commented 3 years ago

Hello, in last year's Ray based LMAPF training version, my own training model can not achieve good path planning in more than 8 agents. In the latest training version, there is an error about ray worker died as follows. What's the problem,please?

(pid=5131) starting episode 5 on metaAgent 5
(pid=5137) running imitation job
(pid=5131) 2021-04-14 10:22:36.118770: I tensorflow/stream_executor/] successfully opened CUDA library locally
(pid=5138) 2021-04-14 10:22:36.543190: I tensorflow/stream_executor/] successfully opened CUDA library locally
(pid=5137) cannot allocate memory for thread-local data: ABORT
E0414 10:22:37.032392  5052  5191] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=15c675b22d037e3bf66d17ba0100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=f66d17ba0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2021-04-14 10:22:37,038 WARNING -- A worker died or was killed while executing task fffffffffffffffff66d17ba0100.
Traceback (most recent call last):
  File "/XX/PRIMAL2-main-re/", line 232, in <module>
  File "/XX/PRIMAL2-main-re/", line 173, in main
    jobResults, metrics, info = ray.get(done_id)[0]
  File "/XX/anaconda3/envs/p2/lib/python3.6/site-packages/ray/", line 1540, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
E0414 10:22:37.053786  5052  5191] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=9d28cb176c7f7501ef0a6c220100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=ef0a6c220100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2021-04-14 10:22:37,059 WARNING -- A worker died or was killed while executing task ffffffffffffffffef0a6c220100.
544211707 commented 3 years ago

I found a possible reason, related to the code max_time += time_limit in Env_Builder,When I delete this line of code according to the latest version, the above error will appear.So does this line of code is needed? And does another change time_limit=time_limit - c_timeis needed?I don't particularly understand~

fire-keeper commented 3 years ago

@544211707 Hi, I meet the same problem like you. I wonder if you have dealt with it. Is there any idea about solving it?

544211707 commented 3 years ago


I found a possible reason, related to the code max_time += time_limit in Env_Builder,When I delete this line of code according to the latest version, the above error will appear.So does this line of code is needed? And does another change time_limit=time_limit - c_timeis needed?I don't particularly understand~

I revise like this,it works but I don't know whether delete it is right.

fire-keeper commented 3 years ago

@544211707 I think I find the essence of the problem. It is cpp_mstar, the compile python wraper of od_mstar , which comsumes too much memory. In "max_time += time_limit", max_time has not been defined, so an exception would be raised, which make programe won't call cpp_mstar but od_mstar. Therefore, the great memory consuming problem is solved by this weird code "max_time += time_limit.

544211707 commented 3 years ago

@fire-keeper OK,I got it~

Qiutianyun456 commented 3 years ago

@544211707 我想我找到了问题的本质。它是 cpp_mstar,od_mstar 的编译 python 包装器,它消耗了太多内存。 在“max_time += time_limit”中,max_time没有被定义,所以会引发异常,这使得程序不会调用cpp_mstar而是od_mstar。因此,巨大的内存消耗问题被这个奇怪的代码“max_time += time_limit”解决了。 hi,Why don't I find this line“max_time += time_limit” of code in

greipicon commented 9 months ago

@544211707我想我找到了问题的本质。它是cpp_mstar,od_mstar的编译python包装器,它消耗了太多的内存。在“max_time += time_limit”中,max_time没有被定义,所以会引发异常,这使得程序无法运行会调用cpp_mstar和od_mstar。因此,巨大的内存占用问题被这个奇怪的代码“max_time += time_limit”解决了。嗨,为什么 我在Env_Builder.py中找不到这行“max_time += time_limit”代码? Hi, I'm getting the same error and also can't find this line of "max_time += time_limit" code in, have you solved this problem yet?