microsoft / AutonomousDrivingCookbook

Scenarios, tutorials and demos for Autonomous Driving
MIT License
2.32k stars 566 forks source link

Space required for local training of DistributedRL? #86

Closed wonjoonSeol closed 5 years ago

wonjoonSeol commented 5 years ago

Problem description

Large space for training the model (110 GB +)

Problem details

Just had my first run to train the model, after a day it crashed in the middle due to low space warning on my ssd. After check I realised the training run occuplied 110GB and had no remaining space on my ssd.

How much space do I need to do the template training model provided by the notebook? (If run for 5 days as suggested by the guide.)

Experiment/Environment details

NextSim commented 5 years ago

I think it's on the order or a terabyte or two

mitchellspryn commented 5 years ago

If you're running on a single node, you can drastically reduce the amount of data being written by reducing the checkpointing frequency. To do this, look in Share/scripts_downpour/app/distributed_agent.py at __publish_batch_and_update_model. There you will see the logic for checkpointing - you can decide to only save every Nth model, or implement logic to clean up older models.