huawei-noah / SMARTS

Scalable Multi-Agent RL Training School for Autonomous Driving
MIT License
951 stars 191 forks source link

ray.memory_monitor.RayOutOfMemoryError #990

Open Meta-YZ opened 3 years ago

Meta-YZ commented 3 years ago

BUG REPORT

High Level Description Hi! Why does the latest version still have this bug?

SMARTS version [0.4.17]

Error logs and screenshots image

LeoLuo0320 commented 3 years ago

I have the same error here when running command "python examples/rllib.py scenarios/loop".

RutvikGupta commented 3 years ago

Hi @Yuanzhuo-Liu @Leoluo0320, I ran the command python3.7 examples/rllib.py scenarios/loop on the remote server (Ubuntu 16.04) and I could not reproduce the error mentioned above. I am able to train and get the output for different PIDS (meaning that its working without any issue for me). Did u tried setting up your environment in venv instead of conda.

@Yuanzhuo-Liu Can u please mention what command you ran to get that result?

Meta-YZ commented 3 years ago

Hi @Yuanzhuo-Liu @Leoluo0320, I ran the command python3.7 examples/rllib.py scenarios/loop on the remote server (Ubuntu 16.04) and I could not reproduce the error mentioned above. I am able to train and get the output for different PIDS (meaning that its working without any issue for me). Did u tried setting up your environment in venv instead of conda.

@Yuanzhuo-Liu Can u please mention what command you ran to get that result?

Thanks! I think the reasons why you did not repeat my mistake are as follows:

So, I recommend that you use an Ubuntu18.04 with a smaller CPU memory to run a program with a longer episode. image

RutvikGupta commented 3 years ago

@JenishPatel99 @christianjans can u please have a look at the ultra/rllib_train.py memory Issue?

LeoLuo0320 commented 3 years ago

Hi, @RutvikGupta, I tried the same command "python examples/rllib.py scenarios/loop" but instead of using the policy "PG" I changed it to "PPO“. It finished the training process successfully and does not have the memory issue. I wonder if it will give you some ideas about the problem.

Meta-YZ commented 3 years ago

Hi @Yuanzhuo-Liu @Leoluo0320, I ran the command python3.7 examples/rllib.py scenarios/loop on the remote server (Ubuntu 16.04) and I could not reproduce the error mentioned above. I am able to train and get the output for different PIDS (meaning that its working without any issue for me). Did u tried setting up your environment in venv instead of conda.

@Yuanzhuo-Liu Can u please mention what command you ran to get that result?

Hi, @RutvikGupta , I tried Venv and still have this problem. I really don't know what to do, this memory problem has been bothering me for a long time. Thanks

Meta-YZ commented 3 years ago

Hi!@RutvikGupta @JenishPatel99 @christianjans . I'm sorry to trouble you all the time. Is there any progress on my problem? Thanks!

RutvikGupta commented 3 years ago

Hi!@RutvikGupta @JenishPatel99 @christianjans . I'm sorry to trouble you all the time. Is there any progress on my problem? Thanks!

HI, @Yuanzhuo-Liu I haven't been able to reproduce the problem on the server but I will try to do it on my local machine.

christianjans commented 3 years ago
  • Your example episode has a shorter length.The same error occurred when I run the 'python ultra/rllib_train.py --task 1 --level easy --policy ppo' and '/home/meta/SMARTS/baselines/marl_benchmark/agents/ppo/baseline-lane-control.yaml'.The common feature is that the default episode is very long.For example, when I run the 'python ultra/rllib_train.py --task 1 --level easy --policy ppo', the episode fills up at 500K of memory

Hi @Yuanzhuo-Liu, sorry for the late reply regarding the ULTRA problem. Can I ask what you mean by "longer episode"? Do you mean a larger number for the agent's max_episode_steps? If you wanted to shorten the episode length, you can specify the --max-episode-steps argument when running ultra/rllib_train.py.

Also, just for clarification, what do you mean by "the episode fills up at 500k of memory"?

Finally, just to confirm, you are using a machine with 32 GB of memory to run these experiments?

Meta-YZ commented 3 years ago
  • Your example episode has a shorter length.The same error occurred when I run the 'python ultra/rllib_train.py --task 1 --level easy --policy ppo' and '/home/meta/SMARTS/baselines/marl_benchmark/agents/ppo/baseline-lane-control.yaml'.The common feature is that the default episode is very long.For example, when I run the 'python ultra/rllib_train.py --task 1 --level easy --policy ppo', the episode fills up at 500K of memory

Hi @Yuanzhuo-Liu, sorry for the late reply regarding the ULTRA problem. Can I ask what you mean by "longer episode"? Do you mean a larger number for the agent's max_episode_steps? If you wanted to shorten the episode length, you can specify the --max-episode-steps argument when running ultra/rllib_train.py.

Also, just for clarification, what do you mean by "the episode fills up at 500k of memory"?

Finally, just to confirm, you are using a machine with 32 GB of memory to run these experiments?

Hi @christianjans , Thank you very much for your reply.

  1. "longer episode" means "--episodes". image

  2. This type of error occurs(ray.memory_monitor.RayOutOfMemoryError) when the program runs to around 550K, and this program cannot continue to run. image

  3. Yes, I'm running this experiment on a 32GB RAM machine. image

Meta-YZ commented 3 years ago

Hi, @RutvikGupta @christianjans .I am very sorry for missing some information. I updated some configuration in rllib_train.py.

I wonder if you repeat my mistake? Thansks

sah-huawei commented 3 years ago

xref Issues #870 and #557 and #855

(just linking the issues for future reference)