tryolabs / luminoth

Deep Learning toolkit for Computer Vision.
https://tryolabs.com
BSD 3-Clause "New" or "Revised" License
2.4k stars 400 forks source link

Resuming train while running on the computer #236

Closed AshwinAce closed 5 years ago

AshwinAce commented 5 years ago

The command for training on a dataset is lumi train -c path_to_config.yml For resuming this on the cloud, we have optional commands which feed the job-id for the same. How do we resume the same when run on a local workstation? Does it automatically resume from the previous trained checkpoint? Currently, I am terminating the program as runtime is very high. I want to verify that resume works even when the termination is forced as such.

dekked commented 5 years ago

Yes. Luminoth resumes from the last stored checkpoint. You might check this by watching the step number in the logs, during training.

AshwinAce commented 5 years ago

Thanks! I stopped training way too soon, for the checkpoints to be saved. Based on the default settings, checkpoints are saved every 600 seconds.