microsoft / AutonomousDrivingCookbook

Scenarios, tutorials and demos for Autonomous Driving
MIT License
2.3k stars 563 forks source link

DistributeRL agent performance nowhere close to gif #93

Closed wonjoonSeol closed 5 years ago

wonjoonSeol commented 5 years ago

Problem description

Improving reward function for DistributedRL

Problem details

RL agent

This gif file in the readme, is this the result of running the base tutorial codes you provided without any modifications? The local training results (> 5 days on GTX 980) just for running the tutorial codes are nowhere close to this performance.

But just wondering why the gif model performs so well without implementing that?

Experiment/Environment details

As per https://github.com/Microsoft/AutonomousDrivingCookbook/issues/85, I have updated code for the latest Airsim binary and currently re-training there. I will see if this makes any difference.

mitchellspryn commented 5 years ago

I'm not sure why your model is not performing well. The trained model in the gif did come from the provided code, although it was trained on a cluster using the distributed method.

wonjoonSeol commented 5 years ago

Running RunModel with sample_model.json without loading any weights shows gif performance with some reset - it still crashes every now and then.

But when I try to train on top of the said model by loading sample_model.json and train further on the local machine actually makes the performances a lot worse. I am not loading any weights just the model, I have commented out the part that loads pretrained weights for the conv layers. As the checkpoint output single json only.

Furthermore, time to time I am getting this error message and the training goes in halt:

Getting Pose
Waiting for momentum to die
Resetting
Running car for a few seconds...
Model predicts 0
Traceback (most recent call last):
  File "distributed_agent.py", line 649, in <module>
    agent.start()
  File "distributed_agent.py", line 84, in start
    self.__run_function()
  File "distributed_agent.py", line 164, in __run_function
    experiences, frame_count = self.__run_airsim_epoch(False)
  File "distributed_agent.py", line 323, in __run_airsim_epoch
    state_buffer = self.__append_to_ring_buffer(self.__get_image(), state_buffer, state_buffer_len)
  File "distributed_agent.py", line 465, in __get_image
    image_rgba = image1d.reshape(image_response.height, image_response.width, 4)
ValueError: cannot reshape array of size 1 into shape (0,0,4)

Any idea on this? Why does the image array have different size sometimes?

mitchellspryn commented 5 years ago

Yes, the sample_model.json isn't perfect, and will sometimes crash.

Further training won't work. You'll end up overfitting. I noticed while training that if we let the model run for too long, it would start to perform worse. Unfortunately, I don't have a great way of detecting the overfitting rather than stopping once it starts to perform decently.

In regards to the error you are getting - it looks like the exe is occasionally not returning any data. I've tried to repro this locally, but can't get it to happen. A simple fix would be to bail out if we receive an image of size zero.

wonjoonSeol commented 5 years ago

That's very interesting. Thank you.