maikherbig / AIDeveloper

GUI-based software for training, evaluating and applying deep neural nets for image classification
BSD 2-Clause "Simplified" License
109 stars 20 forks source link

Power cut and corrupted meta Excel file #57

Closed jonathancolledge closed 6 months ago

jonathancolledge commented 6 months ago

Hello again,

After 4 days training I had a power cut. I wanted to load and continue, but got the error seen in the attached image (I'm loading the latest, but it happens with any of the models that were saved during training).

AID error

I then thought I'd try and open the meta xlsx file. Sadly, that is corrupted. I assume the power cut is likely the cause. I tried Excel repairing, it, but it doesn't. I tried LibreOffice, but what ever character set I choose it is all garbled.

AID meta xlsx garbled

For future, if I lengthen the time between meta.xlsx saves that might lessen the chance of it being written when the power goes off? Also, when loading and continuing, it would be the latest or best model that I would load to continue? e.g. 2024_02_22_EfficientNetB0_Mura_34427.model saved at epoch 34427

If I load and restart with with the .arch file, isn't that the same as just clicking new and choosing the original network? (If I load a model here is give the same error as load and continue). Is it all dependent on the meta.xlsx of is there something else going on?

I'm wondering if I have done enough training - loss was still decreasing as of epoch 34427 and I was at least up to 37000 with no new model saved and the highest accuracy for the validation data was at epoch 30428. The trouble is, even if I want to take Model 30428 and load and continue with new data for transfer learning, it gives the same error.

I'm guessing I need to restart afresh?

maikherbig commented 6 months ago

1) a) corrputed Excel file: That's very unfortunate. AIDeveloper tries to avoid such situations by saving the progress in-between. However if the power cut happens right during saving its probably going to corrupt the file. You are right: when training times are so long, you can increase the time between savings of the excel file. When you are operating in ranges of 10k and more epochs, saving can even take some seconds. Hence, its a good idea to do that less often.

  1. b) Which model to continue training?: Using the last epoch would be ideal as this would be most similar to a single continous training run. However, that model is maybe not saved (AIDeveloper only saves a model if it reaches a new record in validation acc. or validation loss). Hence, I would go for the last model that was saved.

2) Whats' the .arch file?: Yes, the .arch file is only the architecture and you are right: it would be the same like choosing that model in the dropdown and start training fresh. The intended purpose of the .arch file is more to send network archictectures to others. E.g. you chose to modify the EfficientNet in the modelzoo.py and it turns out great, you may want to share just the architecture with colleagues who can try the architecture on their data.

3) Is training enough? Loss and accuracy alone are actually not so interesting. Please compare accuracy and validation accuracy. You want to choose a model with maximim validation accuracy. Loss will always be deceasing during training, but at some point it's just overfitting to the training data. Hence, the model will not be good at new data (this is model is "overfit"). Often, when you start training, both accuracy and validation accuracy steeply increase. In many cases, you get a good model already at the end of that phase (just a couple of epochs). Next, it happens often that accuracy tends to increase faster than the validation accuracy and you have to find a nice point where the model performs well on new data (validation) while not yet being too optimized on the training data. I tried to explain that principle here: https://youtu.be/dvFiSRnwoto?t=384

4) Why are no models saved anymore?: If I understood correctly, you are concerned that no models are saved but you see that the loss is still decreasing. Similar as in 3), please also check out the 'validation loss'. AIDeveloper only looks at validation loss and validation accuracy and saves a model every time a new reccord is reached.

5) Start fresh?: No. I guess this was more than enough training with the chosen parameters. Check out the curves for validation loss and validation acc and choose a model. Please note that in your case where epochs-numbers are on the order of 10k, it can be random luck that a particular model reeached a new record. It can be better to use a model from an earlier epoch.

6) What next? Train again and play around with image augmentation parameters to modify your training data in a way that makes your model more robust (-> better validation accuracy)

jonathancolledge commented 6 months ago

Thank you very much

  1. Yes my best performing model in terms of validation accuracy came around 7000 epochs before the power cut.

I only know because I saved the txt output before hand.

I don't have the excel file meta file now that it is corrupted.

If I want to use this model for transfer learning, given that I get the above error when I "load and continue", can I just ignore that error and use it with my new data?

maikherbig commented 6 months ago

I understand that due to the missing meta.xlsx you get an error when doing 'load and continue'. You can just remove the file and use 'load and continue' without a meta file. If you leave the corrputed meta file, AIDeveloper will probably crash

jonathancolledge commented 6 months ago

That's great news, I'll delete the corrupted file. Thank you so much for all the help.