Different results after saving and loading trained weights

anklebreaker commented 4 years ago

Hey, thanks so much for creating this project! Was easy to use, I was able to customize it to my needs.

I've been trying to train tiny yolov4 on a custom single class. Currently, I'm getting pretty good results in terms of validation loss, and the model converges well. However, if the training session terminates and I either resume training or predict from a checkpoint (either manually from save_weights function or ModelCheckpoint callback), I get wildly different results. The loss jumps from 2 to over 200 as if the weights are untrained. Model predictions return zeros for the most part.

At first, I suspected that my preprocessing was causing this, but after double checking and searching online, it seems that there might be an issue with Keras and/or Tensorflow saving the network. Model.save() unfortunately doesn't work at all.

This post has long thread of others who seem to have a similar issue, and it appears the solution is architecture specific. From what I gathered, some possible causes are from Upsampling layers or Lambda layers in loops. Here's another post describing the solution as fixing a random seed.

For reference, I'm using yolo.model.load_weights() and yolo.model.save_weights().

hhk7734 commented 4 years ago

I haven't tested the tiny training yet. haha.. Anyway, Can you share your test code:?

anklebreaker commented 4 years ago

class SaveCallback(tensorflow.keras.callbacks.Callback):
    def __init__(self):
        super().__init__()
        self.best_loss = np.inf
        self.filedir = "models/"
        self.model = yolo.model
        self.trial = trial

    def on_epoch_end(self, epoch, logs=None):
        keys = list(logs.keys())
        if 'val_loss' in keys:
            if logs['val_loss'] < self.best_loss:
                savepath = self.filedir + 'trial' + str(self.trial) + '-epoch' + str(epoch + 1)
                print('\n' + 'Val_loss improved from ' + str(self.best_loss) + ' to ' + str(logs['val_loss']) + '. Saving model to ' + savepath + '/' + 'ckpt...')
                self.best_loss = logs['val_loss']
                if not os.path.exists(savepath):
                    os.makedirs(savepath)
                yolo.model.save_weights(savepath + '/' + str(self.trial) + '-epoch' + str(epoch + 1) + 'ckpt')
            else:
                print('\n' + 'Val_loss did not improve from ' + str(self.best_loss))

yolo = YOLOv4(tiny=True)
yolo.input_size = 352
yolo.batch_size = 128
yolo.subdivision = 2
yolo.channel_input = 4
yolo.anchors = np.round(np.array([16.30463356, 39.65267033, 23.3268787, 57.3856978, 33.07815072, 79.4291659, 45.28507059, 109.08966657, 65.79054054, 146.20500288, 99.24201597, 208.32784431])).astype(np.int32)
yolo.classes = {0: "relevant_person"}
eval = False
trial = 2
epochs = 1500

yolo.make_model()
yolo.model.load_weights('models/trial2-epoch200/2-epoch200ckpt')
train_data = yolo.load_dataset('traintext.txt')
val_data = yolo.load_dataset('valtext.txt', training=False)
lr = 1.
optimizer = optimizers.Adadelta(learning_rate=lr)
yolo.compile(optimizer=optimizer, loss_iou_type="ciou")

if not eval:
    csvfile = "models/log-" + str(trial) + ".csv"
    csvlog = tensorflow.keras.callbacks.CSVLogger(csvfile, separator=',', append=True)

    yolo.model.fit(
                train_data,
                epochs=epochs,
                verbose=1,
                callbacks=[SaveCallback(), csvlog],
                batch_size=yolo.batch_size // yolo.subdivision,
                steps_per_epoch=yolo.subdivision,
                validation_data=val_data,
                validation_steps=1000//(yolo.batch_size//yolo.subdivision),
                validation_freq=50,
                initial_epoch=200
            )

I made a custom callback to save the model just to test whether it was something wrong with ModelCheckpoint callback.

hhk7734 commented 4 years ago

Please share the inference code too. I'm testing the tpu, so after the test, I'll test your code

anklebreaker commented 4 years ago

pred_im = np.dstack((cv2.imread("ImageSections/2967_340_1583547791_FishEye_24732_0.jpg"), cv2.imread("CourtSections/2967_340_1583547791_FishEye_24732_0.jpg", 0)))
bboxes = yolo.predict(pred_im)
print(bboxes)
yolo.draw_bboxes(pred_im, bboxes)
plt.imshow(pred_im)
plt.show()

I modified the code for a 4 channel input. Inference is run when eval above is True

anklebreaker commented 4 years ago

Unrelated, but I saw in Dataset class under the next() method that the counter was updated and resetted outside the for loop. Not sure if it's intentional or a bug, but I checked that the same image was being sent batch_size number of times per batch. I changed as below it to have the counter and reset inside the loop so it makes a batch with different images.

        if self.batch_size > 1:
            batch_x = []
            #batch_y_s = []
            batch_y_l = []
            batch_y_m = []
            for _ in range(self.batch_size):
                x, y = self.preprocess_dataset(self.dataset[self.count])
                batch_x.append(x)
                #batch_y_s.append(y[0])
                batch_y_m.append(y[0])
                batch_y_l.append(y[1])
                self.count += 1
                if self.count == len(self.dataset):
                    np.random.shuffle(self.dataset)
                    self.count = 0
            batch_x = np.concatenate(batch_x, axis=0)
            #batch_y_s = np.concatenate(batch_y_s, axis=0)
            batch_y_m = np.concatenate(batch_y_m, axis=0)
            batch_y_l = np.concatenate(batch_y_l, axis=0)
            batch_y = (batch_y_m, batch_y_l)
        else:
            batch_x, batch_y = self.preprocess_dataset(self.dataset[self.count])
            self.count += 1
            if self.count == len(self.dataset):
                np.random.shuffle(self.dataset)
                self.count = 0

hhk7734 commented 4 years ago

~~Can you send me a PR with the following changes?~~

commit: 7fc91f630f1f0

hhk7734 commented 4 years ago

@anklebreaker

~~yolo.predict(frame) predicts only one image.~~

Oh, 4channel

hhk7734 commented 4 years ago

It seems to be a problem of training, not a problem of saving and loading.

anklebreaker commented 4 years ago

thanks for the update. What makes you say it is with the training? Training appeared fine as I only had issues when resuming training or testing a prediction in a different session

hhk7734 commented 4 years ago

I just trained with custom data, and the result came out well... Maybe it was a setup error.

hhk7734 commented 4 years ago

I couldn't find a clear answer. So I implemented yolo.save_weights() function. After training, save weights using yolo.save_weights("custom.weights", weights_type="yolo"). Then, when you want to load, load it using yolo.load_weights("custom.weights", weights_type="yolo").

It will be released in the v0.19.0 version. Ref: 35f1d22618, 79daece4538e1

anklebreaker commented 4 years ago

Yeah, it might be some system or version issue. I'll try out the new functions. Thanks for looking into it!

hhk7734 / tensorflow-yolov4

Different results after saving and loading trained weights #14