Closed wonjoonSeol closed 5 years ago
I think it's on the order or a terabyte or two
If you're running on a single node, you can drastically reduce the amount of data being written by reducing the checkpointing frequency. To do this, look in Share/scripts_downpour/app/distributed_agent.py at __publish_batch_and_update_model. There you will see the logic for checkpointing - you can decide to only save every Nth model, or implement logic to clean up older models.
Problem description
Large space for training the model (110 GB +)
Problem details
Just had my first run to train the model, after a day it crashed in the middle due to low space warning on my ssd. After check I realised the training run occuplied 110GB and had no remaining space on my ssd.
How much space do I need to do the template training model provided by the notebook? (If run for 5 days as suggested by the guide.)
Experiment/Environment details