neptune-ai / open-solution-mapping-challenge

Open solution to the Mapping Challenge :earth_americas:
https://www.crowdai.org/challenges/mapping-challenge
MIT License
378 stars 96 forks source link

How to restore the trainning #160

Closed robeson1010 closed 6 years ago

robeson1010 commented 6 years ago

I have trained the data for 3 days but unfortunately the processing interrupted due to some reasons. I have used the 'python main.py -- train --pipeline_name unet_weighted' but it trained from epochs 0. How can I restore the training processing from my last time (54 epochs already)?

jakubczakon commented 6 years ago

@robeson1010 Hi sorry for late resposne. To restart training you need to overwrite the set_model function for example:

    def set_model(self):
        encoder = self.architecture_config['model_params']['encoder']
        if encoder == 'from_scratch':
            self.model = UNet(**self.architecture_config['model_params'])
        else:
            config = PRETRAINED_NETWORKS[encoder]
            self.model = config['model'](**config['model_config'])
            self._initialize_model_weights = lambda: None
            self.load('YOUR_FILEPATH_TO_MODEL')

If you want to load the model that you pretrained that has one of those Resnet archs. It is important to have self._initialize_weights set to None or else it would simply overwrite your loaded weights with random stuff.

When you restart it will start from epoch 0 (though your weights from epoch 54 will be used). I would suggest using a smaller lr if you were using some sort of decay. As of now we are not checkpointing the optimizer state so it will be difficult to restore the exact state of your training at epoch 54 but usually restarting with new optimizer gets the job done.

I hope this helps.

robeson1010 commented 6 years ago

@jakubczakon Really thanks

carbonox-infernox commented 5 years ago

@jakubczakon

"As of now we are not checkpointing the optimizer state so it will be difficult to restore the exact state of your training"

Is this still the case? I was hoping to run the training 5-10 epochs at a time and keep checking on the model's progress. Then I'd like to add some new classes, but that's a different problem. Basically I don't want to pay for the full 100 and then find out that something went wrong, or otherwise pay for 100 when 50 might suffice.