init_with=="last" not working

ChristophNeuner commented 6 years ago

Hello,

I hope you can help me. First I have trained the head layers only. Now I want to use the last saved .h5 file and train all layers with a samller learning rate. But when I change initialize init_with=="last" the training does not work.

Is there something else I have to change?

Thanks in advance!

Christoph

import os

import sys
import random
import math
import time

from bowl_config import bowl_config
from bowl_dataset import BowlDataset
import utils
import model as modellib
from model import log
from glob import glob

import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)

# Root directory of the project
ROOT_DIR = os.getcwd()

# Directory to save logs and trained model
MODEL_DIR = os.path.join(ROOT_DIR, "logs")

# Local path to trained weights file
COCO_MODEL_PATH = os.path.join(ROOT_DIR, "mask_rcnn_coco.h5")
# Download COCO trained weights from Releases if needed
if not os.path.exists(COCO_MODEL_PATH):
    utils.download_trained_weights(COCO_MODEL_PATH)

model = modellib.MaskRCNN(mode="training", config=bowl_config,
                          model_dir=MODEL_DIR)

# Which weights to start with?
init_with = "last"  # imagenet, coco, or last

if init_with == "imagenet":
    model.load_weights(model.get_imagenet_weights(), by_name=True)
elif init_with == "coco":
    # Load weights trained on MS COCO, but skip layers that
    # are different due to the different number of classes
    # See README for instructions to download the COCO weights
    model.load_weights(COCO_MODEL_PATH, by_name=True,
                       exclude=["mrcnn_class_logits", "mrcnn_bbox_fc", 
                                "mrcnn_bbox", "mrcnn_mask"])
elif init_with == "last":
    # Load the last model you trained and continue training
    model.load_weights(model.find_last()[1], by_name=True)
    #model_path = "./logs/bowl20180323T1602/mask_rcnn_bowl_0100.h5"
    #model.load_weights(model_path, by_name=True)

# Training dataset
dataset_train = BowlDataset()
dataset_train.load_bowl('stage1_train')
dataset_train.prepare()

# # Validation dataset
dataset_val = BowlDataset()
dataset_val.load_bowl('stage1_train')
dataset_val.prepare()

# Train the head branches
# Passing layers="heads" freezes all layers except the head
# layers. You can also pass a regular expression to select
# which layers to train by name pattern.
#model.train(dataset_train, dataset_val, 
            #learning_rate=bowl_config.LEARNING_RATE, 
            #epochs=25, 
            #layers='heads')

model.train(dataset_train, dataset_val, 
            learning_rate=bowl_config.LEARNING_RATE / 10,
            epochs=100, 
            layers="all")

I always get one of these two errors:

error error2

fastlater commented 6 years ago

maybe invalid path? check the file, folder name, etc. It should work since model.load_weights is from keras. You can use print function to check step by step what is going on.

ChiaoSun commented 6 years ago

I also got this issue and the path is all correct. And I check the result of model.find_last()[1] is correct too.
This issue caused me to train a whole new model every time.

fastlater commented 6 years ago

@ChiaoSun did you check the def find_last(self): in model.py? Remember to name the folder as config.Name + date +T+time and also name the h5 file as mask_rcnn_bowl_000X.h5
If you still have problem, try to print step by step what is going on inside that function till you find where exactly is the error. I never had problems to load the last training model and continue training it.

ChiaoSun commented 6 years ago

@fastlater I check the def find_last(self) and it return the correct model path (checkpoint\nuclei20180330T0129\mask_rcnn_nuclei_0050.h5). I found model.train may always creates a new checkpoint even I already load weights from the last model. Did I use wrong function(model.train) to continue training model?

fastlater commented 6 years ago

It is correct. If you check the loss in tensorboard, you can notice that the loss start from the value obtained in the last checkpoint trained. The script wont continue adding more checkpoints in that previous folder because maybe you would like to start training something new from this point.

ChiaoSun commented 6 years ago

@fastlater , I notice that the loss start from the last trained model, thank for your help. And I found if there were not .h5 file in the latest checkpoint folder may cause @ChristophNeuner 's NoneType is not callable problem.

matterport / Mask_RCNN

init_with=="last" not working #362