microsoft / AutonomousDrivingCookbook

Scenarios, tutorials and demos for Autonomous Driving
MIT License
2.3k stars 562 forks source link

DistributedRL training - Loss value is so high and not coming down #87

Open kalum84 opened 5 years ago

kalum84 commented 5 years ago

Problem description

The loss values are so high and not coming down over time.

Problem details

We are trying to create a racing environment and use reinforcement learning to train a model to do racing. So we started from this example. We wanted to test how much time it needs to train a model and how fat it can reach. I used the same parameters in the example. Except following one

   max_epoch_runtime_sec = 30

Also didn't change the code. I attached the output file from one agent. Please help me to troubleshoot what the issue is.

Experiment/Environment details

Used existing weights to start with. Started training on Azure with 6 NV6 machines. 5 agents and the trainer. While running the job I restarted the agents after some time. (After 12h) Then run the training for another 20h agent1.txt

mitchellspryn commented 5 years ago

We discussed a bit offline, but this paper might be of interest to you.

The algorithm as written does not infinitely scale. Try 3 or 4 machines.

Also, the model will overfit - there is not concept of early stopping. Try checking back on it after an hour or an hour and a half.