Loss is always NaN - Githubissues

dberma15 commented 6 years ago

Hi,

I'm trying to train my dataset for the Data Science Bowl 2018 competition and I'm having trouble. The loss is always NaN no matter what I try. I know that my data set is structured properly, as it looks like this:

{'image': {'checksum': '2ce2a6891c485df5f5e1175b48cb305f', 'pathname': 'stage1_train/7f38885521586fc6011bef1314a9fb2aa1e4935bd581b2991e1d963395eab770/images/7f38885521586fc6011bef1314a9fb2aa1e4935bd581b2991e1d963395eab770.png', 'shape': {'r': 1024, 'c': 1024, 'channels': 3}}, 'objects': [{'bounding_box': {'minimum': {'r': 12, 'c': 14}, 'maximum': {'r': 14, 'c': 17}}, 'category': 'cell'}, {'bounding_box': {'minimum': {'r': 169, 'c': 109}, 'maximum': {'r': 170, 'c': 110}},  'category': 'cell'},....

So the problem isn't the data. Once I load it and try to run it, the model compiles, but the loss is consistently NaN and I can't figure out why. Can someone help?


import keras

import keras_rcnn.datasets.shape
import keras_rcnn.models
import keras_rcnn.preprocessing
import pickle
import numpy as np

def main():
    trainingdata = pickle.load(open('xtrdata.pkl','rb'))

    msk=np.random.random(len(trainingdata))<.9
    training_dictionary=[ trn for trn, m in zip(trainingdata,msk) if m]
    test_dictionary=[ trn for trn, m in zip(trainingdata,msk) if not m]
    # training_dictionary, test_dictionary = keras_rcnn.datasets.shape.load_data()

    categories = {"cell":1}

    generator = keras_rcnn.preprocessing.ObjectDetectionGenerator()

    generator = generator.flow_from_dictionary(
        dictionary=training_dictionary,
        categories=categories,
        target_size=(256, 256)
    )

    validation_data = keras_rcnn.preprocessing.ObjectDetectionGenerator()

    validation_data = validation_data.flow_from_dictionary(
        dictionary=test_dictionary,
        categories=categories,
        target_size=(256, 256)
    )

    keras.backend.set_learning_phase(1)

    model = keras_rcnn.models.RCNN(
        categories=["cell"],
        dense_units=512,
        input_shape=(256, 256, 3)
    )

    optimizer = keras.optimizers.Adam()
    model.compile(optimizer)
    model.save("test_rcnn.h5")
    model.fit_generator(
        epochs=100, steps_per_epoch=4,
        generator=generator,
        validation_data=validation_data)

if __name__ == '__main__':
    main()

Paulito-7 commented 6 years ago

Hi, I have the same issue here but can't figure it out. I could set a first training that worked but I tried a new one today and have this problem. I am wondering if could come from masks' size or overlapped masks (as these elements are the only ones I changed compared to my previous working training). Are you in that case aswell?

When trying to infer it seems that RPN losses bug (I have very thin boxes on all the height of the picture and predicted of a certain class with 1 of probability, when stopping the training before getting NaN value, I still have that kind of behavior)

EDIT : @dberma15 I found my mistake, this error came from the fact that in my config, I specified NUM_CLASS to be equal to 4 whereas I actually have 7 classes. Now it works properly with 1e-3 as learning rate! Can you confirm that you don't have the same mistake?

ranjeetthakur commented 5 years ago

Try setting active_class_ids = np.zeros([dataset.num_classes], dtype=np.int32) to active_class_ids = np.ones([dataset.num_classes], dtype=np.int32) in model.py

DerOzean commented 5 years ago

I changed the LR to ~1e-4 and it solved the problem. I still don't know what's an optimal LR, but I think something like ~5e-5 should be a good starting point.

AzimAhmadzadeh commented 5 years ago

I also had this problem and I solved it. As everyone mentioned in different issues raised in this repo, the problem is with the learning rate. In my case the original setting in config file is: BASE_LR: 0.02 | STEPS: (60000, 80000) | MAX_ITER: 90000 which caused nan for loss after the 3rd iteration! Then I changed it to: BASE_LR: 0.0025 | STEPS: (480000, 640000) | MAX_ITER: 720000 which comes from dividing the first by 8, and multiplying the other two by 8, as suggested in the readme here. The default setting is set for 8 GPUs. I have only 2. So, some changes were expected.

However, the above changes made the expectation time for training (i.e., eta) from 4 days to 41 days! So, I avoided such a long training by only changing BASE_LR form 0.02 to 0.01. To evaluate whether this is enough or not, I have to see the loss plot and where it plateaus.

mshankr commented 5 years ago

I changed the LR to ~1e-4 and it solved the problem. I still don't know what's an optimal LR, but I think something like ~5e-5 should be a good starting point.

@DerOzean I agree. I set the LR to ~2e-4 and my loss is no longer NaN. maybe when the LR is bigger, the weights exploded??

Behnam72 commented 2 years ago

I have the same problem and my losses for the first two epochs are:

Epoch 1/2 1172/1172 [==============================] - 949s 810ms/step - loss: nan - rpn_class_loss: nan - rpn_bbox_loss: nan - mrcnn_class_loss: 0.7993 - mrcnn_bbox_loss: 0.9282 - mrcnn_mask_loss: 5.8983e-04 - val_loss: nan - val_rpn_class_loss: nan - val_rpn_bbox_loss: nan - val_mrcnn_class_loss: 0.6931 - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00 Epoch 2/2 1172/1172 [==============================] - 137s 117ms/step - loss: nan - rpn_class_loss: nan - rpn_bbox_loss: nan - mrcnn_class_loss: 0.6931 - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: nan - val_rpn_bbox_loss: nan - val_mrcnn_class_loss: 0.6931 - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00

I have also trained this model on my CPU and I don't get Nan there.

Things I have tried:

Reducing the learning rate
double-checking num_classes in config

Does anyone know what might be the issue?

tusharvora commented 1 year ago

@Behnam72 Hey ben I am facing the same issue.

Works perfectly with CPU
Getting "nan" when I use GPU Did you find anything?

hayawar commented 1 year ago

@tusharvora facing this exact issue as yours, were you able to solve it?

tusharvora commented 1 year ago

@tusharvora facing this exact issue as yours, were you able to solve it?

@haya-alwarthan : Try to inspect your model weights using "mrcnn/samples/coco/inspect_weights.ipynb" when you load with mrcnn "model.load_weights" and directly with "keras loading" as mentioned here

Can you confirm that the model weights are initialized correctly and not as shown below in the image( i.e dead weights )? I am still figuring out the exact reason. model_wieghts_ loaded with mrcnn model load_weights

hayawar commented 1 year ago

@tusharvora

The weights seem to load successfully without keras loading. However, I suspect this has something to do with CUDA not being incompatible with the 30's gpus. My machine has RTX3060, cuda 10(minimum support is 11)

RHf995 commented 1 year ago

@tusharvora

The weights seem to load successfully without keras loading. However, I suspect this has something to do with CUDA not being incompatible with the 30's gpus. My machine has RTX3060, cuda 10(minimum support is 11)

Did you solve this issue? I have RTX3080 and using cuda10.0 for mask rcnn. And I am getting all loss as nan

AhmadAlmuhtadi commented 1 year ago

@tusharvora

The weights seem to load successfully without keras loading. However, I suspect this has something to do with CUDA not being incompatible with the 30's gpus. My machine has RTX3060, cuda 10(minimum support is 11)

Did you solve this issue? I have RTX3080 and using cuda10.0 for mask rcnn. And I am getting all loss as nan

Same here with an RTX3070. All loss as nan. Did you figure anything out?

maximenicol commented 1 year ago

I know this is an old thread, but I had the same issue and managed to get the model working on modern GPUs (RTX3080 and RTX4000) using conda, cuda 11.8 and nvidia's own maintained implementation of tensorflow 1.15. Here is my full conda env if you wish to run the model. Happy training!

tf15-gt3x.txt

matterport / Mask_RCNN

Loss is always NaN #355