Stop and Restart training

pirocha commented 5 years ago

I am training a model on MLWIC, for 2 weeks, and it is only in the epoch 12. The computer where I'm running the training sessions needs to be used and restarted, so my question is if there is a way to stop the train session and restart it afterwards to finish the 55 epoch. I noticed it created a screenshot after each epoch, is it possible to compile this information and test the new model in the classify command?

Thank you for your help.

mikeyEcology commented 5 years ago

Yes it should be possible. Do you have a file in L1/USDA182 called "snapshot-XX.data-XXXX-XXXX" that was created on your computer (you should be able to tell based on date the file was modified)? If so, you can specify retrain=TRUE and the model will begin training using the most recent snapshot of the trained model. You can stop a future run after (55-11=) 44 epochs and it will have given you 55 epochs. Currently, you cannot specify the number of epochs, but I can add this functionality soon so you could set num_epochs=44, but I haven't added this yet.

pirocha commented 5 years ago

Thank you. Yes, it does create those files, one per epoch. But doesn't the program overwrite these files when I restart the training? In any case, I can create a copy of this folder.

My other question is, how do I compile the information of the epochs if I don't let the training finish in a single session?

Thank you!

mikey_t notifications@github.com escreveu no dia sábado, 5/10/2019 à(s) 20:04:

Yes it should be possible. Do you have a file in L1/USDA182 called "snapshot-XX.data-XXXX-XXXX" that was created on your computer (you should be able to tell based on date the file was modified)? If so, you can specify retrain=TRUE and the model will begin training using the most recent snapshot of the trained model. You can stop a future run after (55-11=) 44 epochs and it will have given you 55 epochs. Currently, you cannot specify the number of epochs, but I can add this functionality soon so you could set num_epochs=44, but I haven't added this yet.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mikeyEcology/MLWIC/issues/31?email_source=notifications&email_token=AMAR54CNFWZU7VALPZFIHP3QNDQLXA5CNFSM4I5XB3B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANZR6Q#issuecomment-538679546, or mute the thread https://github.com/notifications/unsubscribe-auth/AMAR54E3YRGIXAHJGQOFL6LQNDQLXANCNFSM4I5XB3BQ .

mikeyEcology commented 5 years ago

You won't need to compile the epochs, it will occur automatically because the training will pick up from (retrain from) where the previous training left off. I just updated the package so that if you update on your machine you should be able to specify num_epochs = 44 in the train command.

pirocha commented 5 years ago

Thank you. I tested to stop and restart the train command, with the retrain = TRUE, but I got an error message:

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key scale2/block1/a/beta not found in checkpoint [[{{node save/RestoreV2}} = RestoreV2[dtypes=[DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Can you help me with this? Thank you

mikey_t notifications@github.com escreveu no dia quarta, 9/10/2019 à(s) 13:30:

You won't need to compile the epochs, it will occur automatically because the training will pick up from (retrain from) where the previous training left off. I just updated the package so that if you update on your machine you should be able to specify num_epochs = 44 in the train command.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mikeyEcology/MLWIC/issues/31?email_source=notifications&email_token=AMAR54DAAFI2Y6BSDV3KD6DQNXFHPA5CNFSM4I5XB3B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAXW34A#issuecomment-539979248, or mute the thread https://github.com/notifications/unsubscribe-auth/AMAR54DYGBAQR4DUYNGTMT3QNXFHPANCNFSM4I5XB3BQ .

mikeyEcology commented 4 years ago

Hi @pirocha, Sorry for the delay in getting back to you on this. I have created a new package MLWIC2. In this package you will be able to specify the log_dir that you used for the first part of training when you start re-training your model. Unfortunately, you'll have to start from scratch because the new package will not retrain from the old package. I'm still working out bugs in the new package, so if you run into problems, please post them to the issue board at MLWIC2. Let me know if you have questions

mikeyEcology / MLWIC

Stop and Restart training #31