kengz / SLM-Lab

Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".
https://slm-lab.gitbook.io/slm-lab/
MIT License
1.25k stars 264 forks source link

why i get "terminating" ? #380

Closed lidongke closed 5 years ago

lidongke commented 5 years ago

HI!

I get terminating when i trainning with search mode and connect to env by grpc ,the log like this: "(pid=2023) terminating" and has nothing else logs about this "terminating", my process also killed by it at the same time. why i get that? @kengz @lgraesser

kengz commented 5 years ago

Hey, can u copy the log from your terminal here and fence it like https://help.github.com/en/articles/creating-and-highlighting-code-blocks so it formats properly too? Thanks.

lidongke commented 5 years ago
(pid=2032) [2019-07-08 10:50:40,907 PID:2098 INFO __init__.py log_summary] Trial 0 session 0 dqn-cytraffic_t0_s0 [train_df] epi: 0  t: 57000  wall_t: 1973  opt_step: 911520  frame: 57000  fps: 28.89  total_reward: 141.329  total_reward_ma: 148.981  loss: 4.75619  lr: 1.08201e-11  explore_var: 0.56637  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=2023) [2019-07-08 10:50:41,502 PID:2105 INFO __init__.py log_summary] Trial 1 session 0 dqn-cytraffic_t1_s0 [train_df] epi: 0  t: 46000  wall_t: 1973  opt_step: 735520  frame: 46000  fps: 23.3147  total_reward: 76.3616  total_reward_ma: 72.5925  loss: 10.8611  lr: 3.72085e-10  explore_var: 0.57286  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=2023) `terminating`

this is a part of log, and i test that if i startup too many envs for trainning (such like num_envs = 20),this problem seems frequency occured. I do not know if some relations between them ? @kengz

kengz commented 5 years ago

there's not enough information to tell what's happening from the log. is your CPU or RAM maxed out? if so the process might crash. From what you said the environments you spawn seem to consume all your resources

lidongke commented 5 years ago

i test for a few number envs ,like 1 env ,2 trials , it also has this problem , my CPU or RAM has not maxed out too .Can you tell me where the source code print "terminating" ? And what situation the "terminating" will print?If there has some probloms when use grpc?Thanks.

kengz commented 5 years ago

the terminating does not come from SLM Lab for sure. I scanned the source code for Ray (the search module) and it does not come from there too, it seems. My guess is that this has to do with grpc, probably from running in multiple parallel processes. A few questions to help debug this situation:

  1. Are the ports assigned automatically so they dont conflict?
  2. There is a grpc instance per trial, right?
  3. Even when you use multiple envs there should still be one grpc instance, right?
  4. A few tests you can try is to run search with, and if the problem still appears:
    • a) only 1 trial, but multiple environments, say 5 or even 20
    • b) only 1 trial, and only 1 environment
lidongke commented 5 years ago

This error is about my port used and "terminating" is print by the crash. You can close this issue,thanks!@kengz