Open dberma15 opened 6 years ago
Hi, I have the same issue here but can't figure it out. I could set a first training that worked but I tried a new one today and have this problem. I am wondering if could come from masks' size or overlapped masks (as these elements are the only ones I changed compared to my previous working training). Are you in that case aswell?
When trying to infer it seems that RPN losses bug (I have very thin boxes on all the height of the picture and predicted of a certain class with 1 of probability, when stopping the training before getting NaN value, I still have that kind of behavior)
EDIT : @dberma15 I found my mistake, this error came from the fact that in my config, I specified NUM_CLASS to be equal to 4 whereas I actually have 7 classes. Now it works properly with 1e-3 as learning rate! Can you confirm that you don't have the same mistake?
Try setting active_class_ids = np.zeros([dataset.num_classes], dtype=np.int32) to active_class_ids = np.ones([dataset.num_classes], dtype=np.int32) in model.py
I changed the LR to ~1e-4 and it solved the problem. I still don't know what's an optimal LR, but I think something like ~5e-5 should be a good starting point.
I also had this problem and I solved it.
As everyone mentioned in different issues raised in this repo, the problem is with the learning rate.
In my case the original setting in config file is:
BASE_LR: 0.02 | STEPS: (60000, 80000) | MAX_ITER: 90000
which caused nan
for loss after the 3rd iteration! Then I changed it to:
BASE_LR: 0.0025 | STEPS: (480000, 640000) | MAX_ITER: 720000
which comes from dividing the first by 8, and multiplying the other two by 8, as suggested in the readme here.
The default setting is set for 8 GPUs. I have only 2. So, some changes were expected.
However, the above changes made the expectation time for training (i.e., eta
) from 4 days to 41 days! So, I avoided such a long training by only changing BASE_LR
form 0.02
to 0.01
. To evaluate whether this is enough or not, I have to see the loss plot and where it plateaus.
I changed the LR to ~1e-4 and it solved the problem. I still don't know what's an optimal LR, but I think something like ~5e-5 should be a good starting point.
@DerOzean I agree. I set the LR to ~2e-4 and my loss is no longer NaN. maybe when the LR is bigger, the weights exploded??
I have the same problem and my losses for the first two epochs are:
Epoch 1/2 1172/1172 [==============================] - 949s 810ms/step - loss: nan - rpn_class_loss: nan - rpn_bbox_loss: nan - mrcnn_class_loss: 0.7993 - mrcnn_bbox_loss: 0.9282 - mrcnn_mask_loss: 5.8983e-04 - val_loss: nan - val_rpn_class_loss: nan - val_rpn_bbox_loss: nan - val_mrcnn_class_loss: 0.6931 - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00 Epoch 2/2 1172/1172 [==============================] - 137s 117ms/step - loss: nan - rpn_class_loss: nan - rpn_bbox_loss: nan - mrcnn_class_loss: 0.6931 - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: nan - val_rpn_bbox_loss: nan - val_mrcnn_class_loss: 0.6931 - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00
I have also trained this model on my CPU and I don't get Nan there.
Things I have tried:
Does anyone know what might be the issue?
@Behnam72 Hey ben I am facing the same issue.
@tusharvora facing this exact issue as yours, were you able to solve it?
@tusharvora facing this exact issue as yours, were you able to solve it?
@haya-alwarthan : Try to inspect your model weights using "mrcnn/samples/coco/inspect_weights.ipynb" when you load with mrcnn "model.load_weights" and directly with "keras loading" as mentioned here
Can you confirm that the model weights are initialized correctly and not as shown below in the image( i.e dead weights )? I am still figuring out the exact reason.
@tusharvora
The weights seem to load successfully without keras loading. However, I suspect this has something to do with CUDA not being incompatible with the 30's gpus. My machine has RTX3060, cuda 10(minimum support is 11)
@tusharvora
The weights seem to load successfully without keras loading. However, I suspect this has something to do with CUDA not being incompatible with the 30's gpus. My machine has RTX3060, cuda 10(minimum support is 11)
Did you solve this issue? I have RTX3080 and using cuda10.0 for mask rcnn. And I am getting all loss as nan
@tusharvora
The weights seem to load successfully without keras loading. However, I suspect this has something to do with CUDA not being incompatible with the 30's gpus. My machine has RTX3060, cuda 10(minimum support is 11)
Did you solve this issue? I have RTX3080 and using cuda10.0 for mask rcnn. And I am getting all loss as nan
Same here with an RTX3070. All loss as nan. Did you figure anything out?
I know this is an old thread, but I had the same issue and managed to get the model working on modern GPUs (RTX3080 and RTX4000) using conda, cuda 11.8 and nvidia's own maintained implementation of tensorflow 1.15. Here is my full conda env if you wish to run the model. Happy training!
Hi,
I'm trying to train my dataset for the Data Science Bowl 2018 competition and I'm having trouble. The loss is always NaN no matter what I try. I know that my data set is structured properly, as it looks like this:
So the problem isn't the data. Once I load it and try to run it, the model compiles, but the loss is consistently NaN and I can't figure out why. Can someone help?