microsoft / AutonomousDrivingCookbook

Scenarios, tutorials and demos for Autonomous Driving
MIT License
2.3k stars 563 forks source link

MemoryError. When train DistributedRL after an hour. #104

Open JazzTao opened 5 years ago

JazzTao commented 5 years ago

Your issue may already be reported! Please make sure to search all open and closed issues before starting a new one.

Please fill out the sections below so we can understand your issue better and resolve it quickly.

Problem description

When I train DistributedRL: https://github.com/Microsoft/AutonomousDrivingCookbook/blob/master/DistributedRL/LaunchLocalTrainingJob.ipynb it works at first. After about one hour,I get the error below. (PS: Actually I changed “threshold=np.nan” to "threshold=sys.maxsize" which in the "distributed_agent.py" Line 609 to let it work at the first time I run “train.bat”. I don't know if it matters.)

My english is not very good. I don't know if I express it clearly.

Problem details

Start time: 2019-04-15 07:23:33.036246, end time: 2019-04-15 07:23:45.755073 Percent random actions: 0.10204081632653061 Num total actions: 98 Generating 98 minibatches... Sampling Experiences. Publishing AirSim Epoch. Publishing epoch data and getting latest model from parameter server... Traceback (most recent call last): File "distributed_agent.py", line 643, in agent.start() File "distributed_agent.py", line 80, in start self.run_function() File "distributed_agent.py", line 175, in run_function self.__publish_batch_and_update_model(sampled_experiences, frame_count) File "distributed_agent.py", line 401, in publish_batch_and_update_model gradients = self.model.get_gradient_update_from_batches(batches) File "E:\File\Train_Airsim\AD_Cookbook_AirSim\python36_DRL\Share\scripts_downpour\app\rl_model.py", line 135, in get_gradient_update_from_batches post_states = np.array(batches['post_states']) MemoryError

Experiment/Environment details

Zhenlin-Xu commented 2 years ago

What is your solution? I came up with the same issue on the newest version. Instead of changing “threshold=np.nan” to "threshold=sys.maxsize", i changed it to "threshold=np.inf" in order to run the script without error coming.

Thank U!