marmotlab / PRIMAL2

Training code PRIMAL2 - Public Repo
MIT License
157 stars 59 forks source link

ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. #6

Closed 544211707 closed 2 years ago

544211707 commented 3 years ago

Hello, in last year's Ray based LMAPF training version, my own training model can not achieve good path planning in more than 8 agents. In the latest training version, there is an error about ray worker died as follows. What's the problem,please?

(pid=5131) starting episode 5 on metaAgent 5
(pid=5137) running imitation job
(pid=5131) 2021-04-14 10:22:36.118770: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
(pid=5138) 2021-04-14 10:22:36.543190: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
(pid=5135) terminate called after throwing an instance of 'std::bad_alloc'
(pid=5135)   what():  std::bad_alloc
(pid=5135) *** Aborted at 1618366956 (unix time) try "date -d @1618366956" if you are using GNU date ***
(pid=5135) PC: @                0x0 (unknown)
(pid=5135) *** SIGABRT (@0x3e80000140f) received by PID 5135 (TID 0x7f1c9df65700) from PID 5135; stack trace: ***
(pid=5135)     @     0x7f1c9db7b390 (unknown)
(pid=5135)     @     0x7f1c9d7d5438 gsignal
(pid=5135)     @     0x7f1c9d7d703a abort
(pid=5135)     @     0x7f1c9745f84a __gnu_cxx::__verbose_terminate_handler()
(pid=5135)     @     0x7f1c9745df47 __cxxabiv1::__terminate()
(pid=5135)     @     0x7f1c9745df7d std::terminate()
(pid=5135)     @     0x7f1c9745e15a __cxa_throw
(pid=5135)     @     0x7f1c9745e522 operator new()
(pid=5135)     @     0x7f1c974ad68c std::__cxx11::basic_string<>::_M_construct()
(pid=5135)     @     0x7f1b4f654a09 tensorflow::SerializeToStringDeterministic()
(pid=5135)     @     0x7f1b4f2ed5ab tensorflow::(anonymous namespace)::TensorProtoHash()
(pid=5135)     @     0x7f1b4f2ed6d8 tensorflow::(anonymous namespace)::FastTensorProtoHash()
(pid=5135)     @     0x7f1b4f2e9a33 tensorflow::(anonymous namespace)::AttrValueHash()
(pid=5135)     @     0x7f1b4f2e9e27 tensorflow::FastAttrValueHash()
(pid=5135)     @     0x7f1b110208f7 tensorflow::grappler::UniqueNodes::ComputeSignature()
(pid=5135)     @     0x7f1b11023160 tensorflow::grappler::ArithmeticOptimizer::DedupComputations()
(pid=5135)     @     0x7f1b1103e522 tensorflow::grappler::ArithmeticOptimizer::Optimize()
(pid=5135)     @     0x7f1b1100ea30 tensorflow::grappler::MetaOptimizer::RunOptimizer()
(pid=5135)     @     0x7f1b1100f969 tensorflow::grappler::MetaOptimizer::OptimizeGraph()
(pid=5135)     @     0x7f1b11010e5d tensorflow::grappler::MetaOptimizer::Optimize()
(pid=5135)     @     0x7f1b11013b77 tensorflow::grappler::RunMetaOptimizer()
(pid=5135)     @     0x7f1b11005afc tensorflow::GraphExecutionState::OptimizeGraph()
(pid=5135)     @     0x7f1b1100742a tensorflow::GraphExecutionState::BuildGraph()
(pid=5135)     @     0x7f1b0e312549 tensorflow::DirectSession::CreateGraphs()
(pid=5135)     @     0x7f1b0e313ea5 tensorflow::DirectSession::CreateExecutors()
(pid=5135)     @     0x7f1b0e316120 tensorflow::DirectSession::GetOrCreateExecutors()
(pid=5135)     @     0x7f1b0e31788f tensorflow::DirectSession::Run()
(pid=5135)     @     0x7f1b0bb95251 tensorflow::SessionRef::Run()
(pid=5135)     @     0x7f1b0bd8dd41 TF_Run_Helper()
(pid=5135)     @     0x7f1b0bd8e53e TF_SessionRun
(pid=5135)     @     0x7f1b0bb90dc9 tensorflow::TF_SessionRun_wrapper_helper()
(pid=5135)     @     0x7f1b0bb90e62 tensorflow::TF_SessionRun_wrapper()
(pid=5137) cannot allocate memory for thread-local data: ABORT
E0414 10:22:37.032392  5052  5191 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=15c675b22d037e3bf66d17ba0100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=f66d17ba0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2021-04-14 10:22:37,038 WARNING worker.py:1134 -- A worker died or was killed while executing task fffffffffffffffff66d17ba0100.
Traceback (most recent call last):
  File "/XX/PRIMAL2-main-re/driver.py", line 232, in <module>
    main()
  File "/XX/PRIMAL2-main-re/driver.py", line 173, in main
    jobResults, metrics, info = ray.get(done_id)[0]
  File "/XX/anaconda3/envs/p2/lib/python3.6/site-packages/ray/worker.py", line 1540, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
E0414 10:22:37.053786  5052  5191 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=Runner, class_name=imitationRunner, function_name=job, function_hash=}, task_id=9d28cb176c7f7501ef0a6c220100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=ef0a6c220100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
2021-04-14 10:22:37,059 WARNING worker.py:1134 -- A worker died or was killed while executing task ffffffffffffffffef0a6c220100.
544211707 commented 3 years ago

I found a possible reason, related to the code max_time += time_limit in Env_Builder,When I delete this line of code according to the latest version, the above error will appear.So does this line of code is needed? And does another change time_limit=time_limit - c_timeis needed?I don't particularly understand~

fire-keeper commented 3 years ago

@544211707 Hi, I meet the same problem like you. I wonder if you have dealt with it. Is there any idea about solving it?

544211707 commented 3 years ago

@fire-keeper

I found a possible reason, related to the code max_time += time_limit in Env_Builder,When I delete this line of code according to the latest version, the above error will appear.So does this line of code is needed? And does another change time_limit=time_limit - c_timeis needed?I don't particularly understand~

I revise like this,it works but I don't know whether delete it is right.

fire-keeper commented 3 years ago

@544211707 I think I find the essence of the problem. It is cpp_mstar, the compile python wraper of od_mstar , which comsumes too much memory. In "max_time += time_limit", max_time has not been defined, so an exception would be raised, which make programe won't call cpp_mstar but od_mstar. Therefore, the great memory consuming problem is solved by this weird code "max_time += time_limit.

544211707 commented 3 years ago

@fire-keeper OK,I got it~

Qiutianyun456 commented 3 years ago

@544211707 我想我找到了问题的本质。它是 cpp_mstar,od_mstar 的编译 python 包装器,它消耗了太多内存。 在“max_time += time_limit”中,max_time没有被定义,所以会引发异常,这使得程序不会调用cpp_mstar而是od_mstar。因此,巨大的内存消耗问题被这个奇怪的代码“max_time += time_limit”解决了。 hi,Why don't I find this line“max_time += time_limit” of code in Env_Builder.py?

greipicon commented 9 months ago

@544211707我想我找到了问题的本质。它是cpp_mstar,od_mstar的编译python包装器,它消耗了太多的内存。在“max_time += time_limit”中,max_time没有被定义,所以会引发异常,这使得程序无法运行会调用cpp_mstar和od_mstar。因此,巨大的内存占用问题被这个奇怪的代码“max_time += time_limit”解决了。嗨,为什么 我在Env_Builder.py中找不到这行“max_time += time_limit”代码? Hi, I'm getting the same error and also can't find this line of "max_time += time_limit" code in Env_Builder.py, have you solved this problem yet?