pythonlessons / mltu

Machine Learning Training Utilities (for TensorFlow and PyTorch)
MIT License
160 stars 100 forks source link

Saving and Loading model errors #25

Open Pixel535 opened 11 months ago

Pixel535 commented 11 months ago

Hi, I am trying to train my model on my database according to the tutorial and sometimes the training takes quite a long time so I wanted to load the model saved by callback using this code:

        if os.path.exists("Model/model.h5"):
            HTR_Model = load_model("Model/model.h5")
            new_model = False
        else:
            img_shape = (self.height, self.width, 3)
            HTR_Model = self.HTR_Model(img_shape, characters_num, vocab)
            HTR_Model.compile_model()
            HTR_Model.summary(line_length=110)
            new_model = True

And then continue training with this code:

        earlystopper = EarlyStopping(monitor='val_CER', patience=20, verbose=1, mode='min')
        checkpoint = ModelCheckpoint("Model/model.h5", monitor='val_CER', verbose=1, save_best_only=True, mode='min')
        trainLogger = TrainLogger("Model")
        tb_callback = TensorBoard('Model/logs', update_freq=1)
        reduceLROnPlat = ReduceLROnPlateau(monitor='val_CER', factor=0.9, min_delta=1e-10, patience=10, verbose=1,
                                           mode='auto')
        model2onnx = Model2onnx("Model/model.h5")

        if new_model is True:
            HTR_Model.train(training_data,
                            val_data,
                            epochs=1000,
                            workers=20,
                            callbacks=[earlystopper, checkpoint, trainLogger, reduceLROnPlat, tb_callback, model2onnx])
        else:
            HTR_Model.fit(training_data,
                          validation_data=val_data,
                          epochs=1000,
                          workers=20,
                          callbacks=[earlystopper, checkpoint, trainLogger, reduceLROnPlat, tb_callback, model2onnx],
                          )

Unfortunately I encountered the following error: ValueError: Unknown loss function: CTCloss. Please ensure this object is passed to the custom_objects argument.

So I tried to add this argument like this:

HTR_Model = load_model("Model/model.h5", custom_objects={'CTCloss': CTCloss})

But It didn't work and I got this error: TypeError: CTCloss.__init__() got an unexpected keyword argument reduction

I couldn't solve it so I started looking for other ways to load the model. This time I tried to do it by saving the file in .tf format and load it without custom_objects argument and it caused an error: Unable to restore custom object of type _tf_keras_metric. Please make sure that any custom layers are included in the custom_objects arg when calling load_model() and make sure that all layers implement get_config and from_config.

After that I added argument like this:

HTR_Model = load_model("Model/model.tf", custom_objects={'CERMetric': CERMetric(vocabulary=vocab), 'WERMetric': WERMetric(vocabulary=vocab)})

And the error was TypeError: CERMetric.__init__() missing 1 required positional argument: 'vocabulary' Even though I used this argument. The only thing that works is this code:

HTR_Model = load_model("Model/model.h5", compile=False)
HTR_Model.compile(loss=CTCloss(), metrics=[CERMetric(vocabulary=vocab), WERMetric(vocabulary=vocab)], run_eagerly=False)

But it doesn't seem to be loading all these weights. I also tried using BackupAndRestore and picked up where I left off but still couldn't see if it saves those weights and continues using them. So Is it possible to somehow load a saved model while training is interrupted and continue training it so that it stays in accordance with the tutorial? (For example, I have epoch 53 /1000 and I see that the best value yet was saved to the model.h5 file at 52 epoch so I stop learning and then I want to load the saved model at epoch 52 and continue from there)

pythonlessons commented 11 months ago

Have you tried to create a model and then use .load_weights("path_to.h5")?

Pixel535 commented 11 months ago

yes, but it still seems the same. I tried to print the weight values ​​using get_weights before and after each loading and they seems to be changed but at the time of training, e.g. loss ,cer and wer values ​​are much more diverse than at the time of the last training, additionally after the epoch, according to the checkpoint, only the best results are to be saved but as in the previous training, the result was e.g. 0.63 and after loading the model / weights and going through the epoch, it saves the value 0.68 and as if it does not see the best value which was 0.63.

Pixel535 commented 10 months ago

After some time, I was able to fully train the model, but unfortunately I couldn't load it and test it according to the guide. After training, I commented out where the training was and ran it again with the code that was supposed to validate my model (I will add that for the test to check if everything later in tutorial works I checked the model and code from the tutorial) but unfortunately each test had CER and WER values ​​equal to 1 or more(And training ended by EarlyStopper and values were: CER - 0.0734 and WER - 0.19), as if the model was not valid at all read. All versions and all the code as I checked are identical to the tutorial, and yet it does not want to work properly. I've created a repository and I don't know what's wrong with this code: https://github.com/Pixel535/ScribbleScan---Engineering-thesis Would you have any suggestions on how I can solve this problem?

pythonlessons commented 10 months ago

Are you sure that your vocab is consistent and the same in training and validation? Because I don't see where you save it.

You load it always from a dataset and you are not using any sorting and etc. to make it consistent?

Pixel535 commented 10 months ago

Yes, this was causing invalid validation values. Thank you so much for finding this. I must have forgotten to sort the vocab so that it was identical everywhere. Now after changing the code when I use the function: model.evaluate(validate) it keeps getting one constant value close to the training (Cer was around 0.07) - not like before which was around 0.95:

Validation loss: 5.065186500549316
Validation CER: 0.030590945854783058
Validation WER: 0.10572760552167892

But unfortunately, when I run the validation code where it displays the predicted text and photo, the values ​​are very different and mostly CER and WER values ​​are close to 1 (avg is almost 1), for example:

Label: 6
Prediction: N
CER: 1.0; REV: 1.0

Label: The house was , as he
prediction: i
CER: 1.0; REV: 1.0

Label: The Foreign Office
Prediction: The Toreign Office
CER: 0.1111111111111111; VER: 0.6666666666666666

Label: who used the system - and used it with power and authority .
prediction: 3
CER: 1.0; REV: 1.0

Label: up with no apparent tiredness at all when
Prediction:  
CER: 1.0; WER: 1.0

What could be the reason for this ? Could this be due to additional datasets outside of IAM making the prediction inaccurate ?

pythonlessons commented 10 months ago

Can't tell what is wrong, but the model is not good enough. Maybe you'll need to play around with architecture to improve it or add more data into a training set. Do you try to predict handwriting stuff? Usually, people combine several different datasets to train a model, I used one dataset for simplicity

Pixel535 commented 10 months ago

Yes, I just want to predict what is written on handwritten notes because then I want to add a function that would detect spelling mistakes so I used IAM, about 100 of my photos as dataset and some mathematical symbols dataset. In that case, I will have to test different architectures on the IAM itself for now and add more datasets later