Space required for local training of DistributedRL?

wonjoonSeol commented 5 years ago

Problem description

Large space for training the model (110 GB +)

Problem details

Just had my first run to train the model, after a day it crashed in the middle due to low space warning on my ssd. After check I realised the training run occuplied 110GB and had no remaining space on my ssd.

How much space do I need to do the template training model provided by the notebook? (If run for 5 days as suggested by the guide.)

Experiment/Environment details

Tutorial used: DistributedRL
Environment used: neighborhood
Versions of artifacts used (if applicable): Python 3.6.8, Keras 2.1.2, CuDDN 7.4

NextSim commented 5 years ago

I think it's on the order or a terabyte or two

mitchellspryn commented 5 years ago

If you're running on a single node, you can drastically reduce the amount of data being written by reducing the checkpointing frequency. To do this, look in Share/scripts_downpour/app/distributed_agent.py at __publish_batch_and_update_model. There you will see the logic for checkpointing - you can decide to only save every Nth model, or implement logic to clean up older models.

microsoft / AutonomousDrivingCookbook