Large Negative Loss + Large RAM Usage

Calcu-dev commented 3 years ago

Hi @anchitdharmw ,

I seem to be having problems with large RAM usage and large negative losses. I'll start by explaining the RAM usage.

RAM Usage I have attempted to train on my GPU (NVIDIA GeForce 1060, 6Gb VRAM), and quickly run out of memory. I understand that Mask-RCNN is a large network and will most likely require more memory usage than this, so I can't complain there. However, when running on my CPU I see memory usage of 24Gb+. With only 2 workers and a mini batch size of 2, I consistently see memory usage spike to ~20Gb. My images are only 512x512 so I can't see why this network would take so much memory. This also prevents me from fully utilizing my CPU/utilizing my GPU since memory is the limiting factor.

Negative Losses I know there was an issue (that I encountered as well) with the negative variance. I have applied the workaround you suggested. Even still, I am getting negative losses that hover around -2,000. I haven't been able to finish an epoch of training (245 images) due to the large memory usage and time. If the negative loss could be rectified by allowing it to run for all 10 epochs, I'll do so.

For some context here is some relevant information re: my computer: CPU: Intel Core i9-10850k (10 cores) GPU: NVIDIA Geforce 1060 (6Gb VRAM) RAM: 32Gb DDR4 RAM

Let me know if there is any more information you need from me.

Best, Adam

Calcu-dev commented 3 years ago

In addition, when running the example without changes I get the following error when trying to predict/display results:

Matrix dimensions must agree.

Error in fastRCNNObjectDetector.removeInvalidBoxesAndScores (line 1230)
            remove = remove | ~isfinite(scores);

Error in helper.filterBoxesAfterRegression (line 8)
    [bboxes, scores, labels] = fastRCNNObjectDetector.removeInvalidBoxesAndScores(bboxes, scores, labels);

Error in detectMaskRCNN (line 97)
[bboxes, scores, labels] = helper.filterBoxesAfterRegression(bboxes,scores,labels, imageSize);

The COCO dataset ran through your exact MATLAB code did not yield negative losses, however I have had this new issue for both my dataset and the COCO dataset.

Best, Adam

kuldeep429 commented 3 years ago

I have also tried this code on my data. I got a similar error "Matrix dimensions must agree......". It would be helpful if we get some solution.

Best kuldeep

anchitdharmw commented 3 years ago

Adam, Thanks for reporting the issues. I will look into this and get back to you soon!

anchitdharmw commented 3 years ago

I have also tried this code on my data. I got a similar error "Matrix dimensions must agree......". It would be helpful if we get some solution.

Best kuldeep

Kuldeep, do you see this error message during prediction as well?

kuldeep429 commented 3 years ago

I have also tried this code on my data. I got a similar error "Matrix dimensions must agree......". It would be helpful if we get some solution. Best kuldeep

Kuldeep, do you see this error message during prediction as well?

No. It is during the training process. With the COCO dataset it ran without any error...both training and prediction.

Calcu-dev commented 3 years ago

Hi @anchitdharmw ,

Any update on this? I know you're probably a busy person, but if you've had a chance to look into it and have anything to share, it would be greatly appreciated.

Best, Adam

anchitdharmw commented 3 years ago

Hi Adam,

Thanks for you patience. Looks like you are facing 3 issues here-

High memory usage.
High negative Loss.
Error during inference.

High memory Usage I am able to see the high memory usage at my end as well and am currently trying to investigate this. In the meantime here are a few suggestions to reduct the memory usage-

User smaller number of proposals from the region proposal layer ~ 750. You can change this by setting the params.NumStrongestRegions = 750 in the createMaskConfig.m file.
Using a smaller reset-50 backbone.(I'll update the example with this support)

High negative loss Could you provide more information about this once you are able to run the training.

Error during Inference I am able to reproduce this issue. I'll submit a fix for this.

I'll post here once I have more information on the memory usage. As for the resent-50 backbone, I'll update the example soon with this support.

Thanks.

anchitdharmw commented 3 years ago

Adam, A quick note about "Error during Inference" - looks like I've already fixed this in the repo as part of another bug fix. Could you make sure that your repository is updated? This might even resolve the high negative loss issue that you are seeing.

Thanks. -Anchit

Calcu-dev commented 3 years ago

Hi @anchitdharmw ,

Thanks for looking into these issues for me!

To answer your questions:

1.) I'll try reducing the RP number and replacing with the resnet-50 backbone and edit this post with the results re: RAM usage.

2.) I'm able to train, however I haven't let the training complete because the loss for all training in the first epoch is usually -2,000 which (I believe) isn't supposed to happen for a Cross Entropy loss function. In addition, the loss is always positive in your example with the COCO dataset, so I'm wondering where this is coming from. I've checked my data formatting several times to ensure the .mat files are the same as your COCO is setup.

The only thing I've just noticed is that you have your masks set up as W x H x (number of masks) instead of H x W x (number of masks). Could this potentially be affecting it? And just for the sake of completeness, why is it flipped like that?

3.) I have the latest repo as of Nov. 23, 2020. I believe this is your last commit, and inference doesn't work on the example nor on my own dataset.

As always, I'm happy to share any more information that you need and I greatly appreciate you looking into these for me.

---------------------

EDIT: I switched to resnet-50 and did see the ram usage drop from 20+ to around 13-16GB! Looks like the deeper network definitely had an effect on RAM usage.

If you want to include this in the repo as a possible suggestion to reduce RAM usage, I had to change the following:

createMaskRCNN line 8 Rename to resnet50

createMaskRCNN line 12 Rename 'data' to 'input_1' for inputlayer arguments

createMaskRCNN line 26 Rename 'res5c_relu' to the last RelU layer in resnet50 before RPN -> 'activation_49_relu'

Best, Adam

anchitdharmw commented 3 years ago

Re: Inference error - a quick check, on Line 40 - before creation of the config object,

do you have classNames = [classNames {'Background'}]; or classNames = [classNames {'background'}]; ?

Calcu-dev commented 3 years ago

@anchitdharmw I have: classNames = [classNames {'Background'}]; in both cases.

anchitdharmw commented 3 years ago

@Calcu-dev, I just updated the repo with resnet50 support and fix for inference code.

Just looked at your resnet50 support changes and they are spot on.

anchitdharmw commented 3 years ago

@Calcu-dev, Re: training memory usage - We are looking into optimizing this further. Thanks for your help with debugging and patience.

Calcu-dev commented 3 years ago

@anchitdharmw ,

Thank you so much for your help thus far! Only lingering question is re: the large negative loss. I will neglect it and attempt to let the net train entirely and see what happens, but I'm doubtful that will work. If you have any suggestions with methods to try, I would greatly appreciate it.

As always, if you need anything from me to help with this process, let me know! I appreciate you taking the time and dealing with these issues.

Best, Adam

Djomana13 commented 3 years ago

@anchitdharmw , Thank you for your efforts SIR please, I am a PhD student, I work on the change detection of satellite images of which I

wanted to exploit the Mask_RCNN method, while I started to implement the Faster_RCNN method where I had results, except

that to add the MaskRCNN part bet me a little difficult, at the cost I would like ask how to prepare my

database for network training, please as indicated in the help of the matlab "Getting Started with Mask R-CNN for Instance

Segmentation", I would like to have just the code how to present the data for the training of MaskRCNN.

Thank you for help Sir.

matlab-deep-learning / mask-rcnn

Large Negative Loss + Large RAM Usage #3